fredag den 15. februar 2008

Migrating Plone Site via JSON files #2

I sometimes wonder if I am particularly stupid, or if everybody goes through the same process as I do. Every programming problem I try to solve I basically approach with an idea of how I want to do it. Then I make a first version. No more planning than that.

If I have made something similar before, the solution is very close to my initial idea. If it is a new area of expertice the solution often ends up very differently from what I had visualized. I find a problem, i fix it and refactor the code. I find a new problem i fix it and refactor the code.

It works for me, and I find that it usually produces quality code in the end. Especially since I am pretty rutheless with the refactoring.

But I keep wondering if I would code faster if I did more upfront design.

Export is done



I think that I am finally finished with the export part of this json migrator. Well ok. This is a customer project and I am focusing on a specific subset of the content types. So it is not a general tool, but that is ok for now. I keep refactoring and using general methods to solve the problems.

I can export folders, files, document and images. Events should be easy too. I just don't need them yet. I can also export some custom types. Each custom type need about 15 lines of code.

The major thing I have not looked at yet is discussions/talkbacks.

Recap of the data structure



The (simplified) content object now looks like this.


class Document:
getId='my-document'
meta_type='Document'
Title = 'My Title'
Description = 'An example content object'
content_type = 'text/html'
text = 'The text in the content object'


A file like this


class Image:
getId='this-is-an-image.jpg'
meta_type='Portal Image'
Title = 'This is an image'
Description = 'An example image object'
content_type = 'image/jpg'
data = 'dagfhh3gh3gf3hgf' # mysterious data that is a jpeg image :-s


The document type was easy to save as json. I just mapped every attribute to a key/val in a dict.


document = {
u'getId':u'my-document',
u'meta_type':u'Document',
u'Title':u'My Title',
u'Description':u'An example content object',
u'getRawText':u'The text in the content object',
u'content_type':u'text/html',
}


The paths chosen, with many backtracks



A minor problem was that the large amount of html, in some of the documents, made it harder to read the json files in the text editor. Just like xml files with cdata sections.

A bigger problem was the image/file content type. I first tried to save the "data" attribute as a value in the dict. But the json writer did not like that. It turned out that there was some chars that it could not escape. It probably needs to be utf-8 or something like that. I did not digg to deep into the reason, but chose to encode the data.

The data is a string, so it is pretty simple to just use base64:


>>> 'this will be encoded'.encode('base64')
'dGhpcyB3aWxsIGJlIGVuY29kZWQ=\n'


But large image files did not exactly improve the visual signal to noise ration in the json files either. I tried for about 15 minutes to find a library that would linebreak the encoded data to at least improve the readability a bit, but finally realised that I had fallen in love with the idea of the json purity and neede to let go.

It is like that sometimes. You get an idea and have some resistance to letting it go. Authors have an expression called 'kill your darlings'. Same thing.


Base64 encoded data is also about 30% larger, and when the first 19 MByte (.bmp) files turned up in my export dir I was sure I was on the wrong track again.

The huge .bmp image file was just a portait of some random person. there was absolutely no reason for it to take up more than 1/2 MB as a reasonably compressed jpeg file.

There is one thing I have learned through the years. Users do not do what they should. They do what they can. And if they can upload a 12 MB bmp image, they will. So I expect this to be a normal use case on many sites.

A 19 MBytes file would need to be loaded into ram, and then converted back into the binary file, that would be about 12 MBytes. Meaning that I would use 30 MBytes of ram just for this single json file.

With a Plone site of several thousand objecs it can get ugly real fast.

Besides that an object made with AT can easily have several image fields each with a multi MB image.

My idea of using just json for export did not look to good.

Then I played with the idea of keeping the data in the data field and then swap it out with a filename when I wrote it out to disc. So this structure:


document = {
...
u'data':u'dagfhh3gh3gf3hgf' # jpeg image
}

Would be changed to this when saved:

document = {
...
u'data':u'file.1.json.1.bin' # jpeg image file name
}


And it would be saved as two seperate files on the filesystem.


file.1.json # the json file
file.1.json.1.bin # the data file


That was pretty clean, and I liked that it was easy to see which data file belonged to which json file. The json file was kind of a meta data file for the data file.

But now that the data was saved seperately it annoyed me that I could not see what kind of file it was. If I want to view the file "file.1.json.1.bin" i have to look in "file.1.json", see that it was a jpeg file and then rename it to "file.1.json.1.jpg". And then I could still not guess the content from the name.

Sometimes unskilled labor is a better solution than an overpriced programmer. If you only have a few files that you care about in your old CMS, it can be practical to just export them and then do a manual import.

If you can see all the files from your site in one directory it can also give a good overview of the work you need to do.

For these reasons I found that I had to rewrite once more and give the data files some sort of meaningfull names.

This turned out to be surpricingly difficult. As usual the devil is in the details.

More in the next article

0 kommentarer: