Sunday 27 February 2011

A website's content journey

I had an email from an old friend a few weeks ago. He had a site with a bunch of work related research pages and documents collaboratively edited by a handful of people, and had let its domain lapse. It was a 6 year old plone site that he and his fellow editors suddenly wanted back up, for another 5 years, but had no cash to do so. It also needed to move hoster. The first thing to do was re-register the domain, which costs around 40 dollars for a cut price option - now I needed the site back to point it to. Initially I assumed a static dump was bound to be the quickest and cheapest option.

I had a look at some static site python tools such as hyde, this is a django based mirroring tool inspired by Ruby's Jekyll. Where simple database apps such as blogs can be dumped to static files for performance, whilst still being editable. However my friend was not a techie so was unlikely to be able to cope with file system editing of text 'content' files. So for the time being I just ran httrack over the site to dump it to the file system. Next I copied it over to a 'free' amazon micro instance. Since this was a static site, using Apache also seemed overkill, and I thought it was long overdue that I tried out nginx.
However there proved to be almost nothing to try out, since the default Amazon AMI comes with an nginx package. All you need do is add a new micro instance, start it up, run
>sudo yum install nginx
and copy the contents of a httrack dump of the site to /usr/share/nginx/html
Thats it. It was very fast and the config was very simple. A big thumbs up for nginx then, and I also quite like its Russian constructivist styled website, especially now the original Russian only documentation has many translations ;-) The final stage was to assign an Amazon elastic ip to the instance and point the domain registration at that ip.

Great the site was back and seemed pretty nippy, however two problems - it was no longer a CMS, and secondly Amazon's free micro instances are actually only available as a month of uptime hours free trial. After that the hosting was a lot more expensive than a standard minimal shared hoster, and neither option were free. So if hosting was to be paid for I might as well do a proper job and upgrade the plone site to current plone 4, so making it a CMS again.

Fortunately I released a tool that does just that, called ilrt.contentmigrator, a year or so ago.
It takes content from old plone (eg 2.0) and exports it to a simple email style format of content and metadata, that can be reimported to a new plone site. Only problem was I hadnt updated all the tests and bits and pieces to release a plone 4 version yet. But since 4 had been out for some months, it was high time that I did, and this was the excuse I needed. I got the upgrade done and exported the site, where it ran happily in plone 4.
So now I had a working CMS back up on the old domain, and could run it up on an Amazon micro as a fully featured CMS again. So email my friend - its back you can edit it, only problem - its going to cost about 20 dollars a month.

Ahhh now of course I should of recalled that one of my friends defining characteristics was being a tight wad - the idea of paying hundreds of dollars over the next 5 years meant the site was effectively down again! So back to the drawing board. Ok so with all these free services / cloud technologies out there these days, there must be a cheaper solution. A quick hunt around and the answer was obvious, the CMS had no sophisticated features, so a free Google site would easily cover my friends requirements whilst not requiring the dropping of the site-like strucure and collaborative document nature that a simpler blog, such as a Wordpress solution might do.

So set up a Google site, now put the content in. Well of course I could just tell my friend to cut and paste it all, but Google has a pretty extensive data and provisioning API. I had already written a content migrator for Plone. Why not make it work between Plone and Google sites API as well. So using the python wrapper for the restful Atom feed based Google data APIs, I added an export tool that writes the basic content types and folders from Plone to a Google site.
The two share in common the storage paradigm of a NOSQL database and a folder like interface to content creation. Plone has an inherent folder like storage paradigm at an internal level implemented via its acquisition mechanism within the ZODB, whilst Google sites have a much thinner skin of folder like behaviour added by parent child node properties to the objects stored in its BigTable hash table cloud (the shared storage behind site, apps, app engine etc.)
As it turned out this meant that the writing of a migration tool to push the more metadata rich content from Plone to Google was quite straight forward. I rewrote the import to Plone script as an import to Google one, using the gdata library. So the site was up as a Google site. Change that domain ip again, and my friend had his site back, for free, hurray job done.

However I couldn't quite leave it there. I had written a tool to move simple Plone sites to Google for free hosting. But there was probably at least as big a use case for moving more limited design, content types and workflow Google sites to Plone, when those sites customisation demands have outgrown their Google site origins. On top of that I should really write some tests and package things up properly to add these features to my new ilrt.contentmigrator release.
As it turned out migrating from Google site to Plone was a little harder, partly because the Google sites restful Atom API, doesn't expose the folder tree layer
of content by default. So all content is available from the root feed, but it misses empty folders out. Also there seemed to be a bug with getting the folder's (or filecabinets as Google sites calls them) summary text. I guess the API is still in Labs status so this may be fixed in time.

Anyhow I have released it as a first version for the standard folder, pages and file attachment types. So I hope somebody has reason to use the tool's new features, and can give me any feedback when they do.