Monday, January 13, 2014

Postgres character set conversion woes

I had to struggle with sorting out some badly encoded data in Postgresql over the last day or so.
This proved considerably more hassle than I expected, partly due to my ignorance of the correct syntax to use to convert textual data.

So on that basis I thought I would share my pain!

There are a number of issues with character sets in relational databases.

For a Postgres database the common answers often relate to fixing the encoding of the whole database. So if this is the problem the fixes are often just a matter of setting your client encoding to match that of the database. Or to dump the database then create a new one with the correct encoding set, and reload the dump.

However there are cases where the encoding is only problematic for certain fields in the database, or where you are creating views via database links between two live databases of different encodings - and so need to fix the encoding on the fly via these views.

Ideally you have two databases that are both correctly encoded, but just use different encodings.
If this is the case you can just use convert(data, 'encoding1', 'encoding2') for the relevant fields in the view.

Then you come to the sort of case I was dealing with. Where the encoding is too mashed for this to work. So where strings have been pushed in as raw byte formats that either don't relate to any proper encoding, or use different encodings in the same field.

In these cases any attempt to run a convert encoding function will fail, because there is no consistent 'encoding1'

The symptoms of such data is that it fails to display. So is sometimes its difficult to notice until
the system / programming language that is accessing the data throws encoding errors.
In my case the pgAdmin client failed to display the whole field so although the field appears blank, matches with like '%ok characs%' or length(field) still work OK. Whilst the normal psql command displayed all the characters except for the problem ones, which were just missing from the string.

This problem has two solutions:

1. Repeat the dump and rebuild approach with the correct encoding, but to write a custom script in Perl, Python or the like to fix the mashed encoding - assuming that the mashing is not so entirely random as to be fixable via an automated script*. If it isn't - then you either have to detect and chuck away bad data - or manually fix things!

2. Fix the problem fields via pl/sql, pl/python or pl/perl functions that process these to replace known problem characters in the data.

I chose to use pl/sql since I had a limited set of these problem characters, so didn't need the full functionality of Python or Perl. However in order for pl/sql to be able to handle the characters for fixing, I did need to turn the problem fields into raw byte format.

I found that the conversion back and forth to bytea was not well documented, although the built in functions to do so were relatively straight forward...

Text to Byte conversion => text_field::bytea

Byte to Text conversion => encode(text_field::bytea, 'escape')

So employing these for fixing the freaky characters that were used in place of escaping quotes in my source data ...

CREATE OR REPLACE FUNCTION encode_utf8(text)
  RETURNS text AS
$BODY$
DECLARE
    encoding TEXT;
BEGIN
    -- single quote as superscript a underline and Yen characters              
                                            
    IF position('\xaa'::bytea in $1::TEXT::BYTEA) > 0 THEN
        RETURN encode(overlay($1::TEXT::BYTEA placing E'\x27'::bytea from position('\xaa'::bytea in $1::TEXT::BYTEA) for 1), 'escape');
    END IF;

    -- double quote as capital angstroms character                                                                                                                              
    IF position('\xa5'::bytea in $1::TEXT::BYTEA) > 0 THEN
        RETURN encode(overlay($1::TEXT::BYTEA placing E'\x22'::bytea from position('\xa5'::bytea in $1::TEXT::BYTEA) for 1), 'escape');
    END IF;
    RETURN $1;
END;
$BODY$

Unfortunately the Postgres byte string functions don't include an equivalent to a string replace and the above function assumes just one  problem character per field (my use case), but it could be adapted to loop through each character and fix it via use of overlay.
So the function above allows for dynamic data fixing of improperly encoded text in views from a legacy database that is still in use - via a database link to a current UTF8 database.

* For example in Python you could employ chardet to autodetect possible encoding and apply conversions per field (or even per character)

Monday, January 6, 2014

WSGI functional benchmark for a Django Survey Application

I am currently involved in the redevelopment of a survey creation tool, that is used by most of the UK University sector. The application is being redeveloped in Django, creating surveys in Postgresql and writing the completed survey data to Cassandra.
The core performance bottleneck is likely to be the number of concurrent users who can simultaneously complete surveys. As part of the test tool suite we have created a custom Django command that uses a browser robot to complete any survey with dummy data.
I realised when commencing this WSGI performance investigation that this functional testing tool could be adapted to act as a load testing tool.
So rather than just getting general request statistics - I could get much more relevant survey completion load data.

There are a number of more thorough benchmark posts of raw pages using a wider range of WSGI servers - eg. http://nichol.as/benchmark-of-python-web-servers , however they do not focus so much on the most common ones that  serve Django applications, or address the configuration details of those servers. So though less thorough, I hope this post is also of use.

The standard configuration to run Django in production is the dual web server set up. In fact Django is pretty much designed to be run that way, with contrib apps such as static files provided to collect images, javascript, etc. for serving separately to the code. Recognizing that in production a web server optimized for serving static files is going to be very different from one optimized for a language runtime environment, even if they are the same web server, eg. Apache. So ideally it would be delivered via two differently configured, separate server Apaches. A fast and light static configured Apache on high I/O hardware, and a mod_wsgi configured Apache on large memory hardware. In practise Nginx may be easier to configure for static serving, or for a larger globally used app, perhaps a CDN.
This is no different from optimising any web application runtime, such as Java Tomcat. Separate static file serving always offers superior performance.

However these survey completion tests, are not testing static serving, simpler load tests suffice for that purpose. They are testing the WSGI runtime performance for a particular Django application.

Conclusions

Well you can draw your own, for what load you require, of a given set hardware resource! You could of course just upgrade your hardware :-)

However clearly uWSGI is best for consistent performance at high loads, but
Apache MPM worker outperforms it when the load is not so high. This is likely to be due to the slightly higher memory per thread that Apache uses compared to uWSGI

Using the default Apache MPM process may be OK, but can make you much more open to DOS attacks, via a nasty performance brick wall. Whilst daemon mode may result in more timeout fails as overloading occurs.

Gunicorn is all Python so easier to set up for multiple django projects on the same hardware, and performs consistently across different loads, if not quite as fast overall.

I also tried a couple of other python web servers, eg. tornado, but the best I could get was over twice as slow as these three servers, they may well have been configured  incorrectly, or be less suited to Django, either way I did not pursue them.

Oh and what will we use?

Well probably Apache MPM worker will do the trick for us, with a separate proxy front-end Apache configured for static file serving.
At least that way, its all the same server that we need to support, and one that we are already well experienced in. Also our static file demands are unlikely to be sufficient to warrant use of Nginx or a CDN.

I hope that these tests may help you, if not make a decision, maybe at least decide to try out testing a few WSGI servers and configs, for yourself. Let me know if your results differ widely from mine. Especially if there are some vital performance related configuration options I missed!

Running the functional load test

To run the survey completion tool via number of concurrent users and collect stat.s on this, I wrapped it up in test scripts for locust.

So each user completes one each of seven test surveys.
The locust server can then be handed the number of concurrent users to test with and the test run fired off for 5 minutes, over which time around 3-4000 surveys are completed.

The number of concurrent users tested with was 10, 50 and 100
With our current traffic peak loads will probably be around the 20 users mark with averages of 5 to 10 users. However there are occasional peaks higher than that. Ideally with the new system we will start to see higher traffic, where the 100 benchmark may be of more relevance.

Fails

A number of bad configs for the servers produced a lot of fails, but with a good config these seem to be very low. So all 3 x 5 minute test runs for each setup created around 10,000 surveys, these are the actual number of fails in 10,000
so insignificant perhaps ...

Apache MPM process = 1
Apache MPM worker = 0
Apache Daemon = 4
uWSGI = 0
Gunicorn = 1

(so the fastest two configs both had no fails, because neither ever timed out)

Configurations

The test servers were run on the same virtual machine, the spec of which was
a 4 x Intel 2.4 GHz CPU machine with  4Gb RAM
So optimum workers / processes = 2 * CPUs + 1= 9

The following configurations were arrived at by tinkering with the settings for each server until optimal speed was achieved for 10 concurrent users.
Clearly this empirical approach may result in very different settings for your hardware, but at least it gives some idea of the appropriate settings - for a certain CPU / memory spec. server.

For Apache I found things such as WSGIApplicationGroup being set or not was important, hence its inclusion, with a 20% improvement when on for MPM prefork or daemon mode, or off for MPM worker mode.

Apache mod_wsgi prefork

WSGIScriptAlias / /virtualenv/bin/django.wsgi
WSGIApplicationGroup %{GLOBAL}

Apache mod_wsgi worker

WSGIScriptAlias / /virtualenv/bin/django.wsgi

<IfModule mpm_worker_module>
#  ThreadLimit    1000
    StartServers         10
    ServerLimit          16
    MaxClients          400
    MinSpareThreads      25
    MaxSpareThreads     375
    ThreadsPerChild      25
    MaxRequestsPerChild   0
</IfModule>

Apache mod_wsgi daemon

WSGIScriptAlias / /virtualenv/bin/django.wsgi
WSGIApplicationGroup %{GLOBAL}

WSGIDaemonProcess testwsgi \
    python-path=/virtualenv/lib/python2.7/site-packages \
    user=testwsgi group=testwsgi \
    processes=9 threads=25 umask=0002 \
    home=/usr/local/projects/testwsgi/WWW \
    maximum-requests=0

WSGIProcessGroup testwsgi

uWSGI

uwsgi --http :8000  --wsgi-file wsgi.py --chdir /virtualenv/bin \
                               --workers=9 --buffer-size=16384 --disable-logging


Gunicorn

django-admin.py run_gunicorn -b :8000 --workers=9 --keep-alive=5


Thursday, November 21, 2013

Django Cardiff User Group

Last night I went to the second meeting of the Django Cardiff User Group.

This is a sister group to the DBBUG Bristol based one that I have been attending for the last 5 years. It was organised by Daniele Procida, who started attending DBBUG events a few years ago and has now decided to spread the word over the Severn, in Wales.

He is also organising the first UK Django conference in a couple of months, https://djangoweekend.org/ so its good to see one open source / Python group be inspiration for spawning another, and one that is perhaps more organisationally active than its progenitor.

The evening was fun, and it was good to meet and chat with Djangonauts over the border.

Andrew Godwin, Django core developer / release manager, gave us an update on all the new goodies to be added in Django 1.7
So this release is largely about really sorting out the niggling issues with relational database features, and the low level ORM handling of them.
It sees rationalisation of transaction handling with the use of nestable atomic statements, addition of generic connection pooling, and handling of composite keys.

Daniele demonstrated how to fly a helicopter (a toy one) via the Python command line, although Andrew seemed rather more adept at landing it safely. I gave a little reprise of a talk introducing DBBUG and how a developer can follow the road to their own open source contributions.

Thanks to everyone involved, I hope to get to the Django weekend too.

The ten commandments of software procurement

For a medium to large scale organisation with its own IT department, I have found in today's market the following truths for software procurement apply. Yet they are usually poorly understood by staff in organisations outside the software sector. Who often view the world through antique pre1990 glasses, before the significant impact of  web based providers, and the mixed economy of revenue models of  modern software companies ...
  1. Software is like any other creative output, it differs radically in quality, modernity and appropriateness - and this is entirely unrelated to its cost. Partly because the majority of today's leading software development companies are internet companies who do not use software charging for revenue. 
  2. So whether or not software is charged for directly via a licensing model is unrelated to whether it is mostly open source or closed source / commercial. Some software is no longer purchasable or the paid for solutions are too poor quality to be viable, compared to the free ones. In such cases other non-financial trading decisions must be part of the procurement arsenal. So policies on data release, etc.
  3. Whether something is open or closed source is entirely irrelevant to its quality, scalability or any other attribute you care to name. These days any software stack is likely to be a mix of both.
    However given source, tests, community and commit rate can all be checked for the former, it is far easier not to pick a lemon, with open source (not that a non-technical organisation tends to use any of these core indicators for procurement assessment).
  4. Software is basically like literature - there are your Barbara Cartland's and your Shakespeare's - unfortunately less people are able to read it to work out what quality it is, so its a book which is generally just judged by its cover - hence the common misconception that software is all roughly the same - or that its quality relates to its cost.
  5. However, the more generic a software application is, the more likely it is that you get better quality for a lower cost - standard economy of scale.
    Hence Google GMail / Microsoft Office / open source Apache - are good quality - because they are large scale generic applications.  
    The more specific an application is, the more likely the software (whether open source or commercial) will have been put together by a core group of at most 3 or 4 developers, hence have less quality control methods applied, be more buggy and risk being generally of a lower standard.
  6. If the IT Services department of your organisation is not sufficiently powerful enough to tell the users what they are going to get, despite what they want. It is common that many systems it deploys will require significant customisation, the more specific they are, the more the customisation.
    Customisation of out sourced, closed source products is likely to incur significantly greater time and development cost than open source ones. Whether customised in house or out sourced. If customised in house then unless the software has a well designed API, docs etc. - ie is a widely used generic system from a major company. You usually find that you can only do black box integration and wrapper coding or resort to breaking license agreements by decompiling. All of which is difficult to maintain.
    If out sourced, then the code may be open, test suited and documented within the supplying company, but you are likely to be paying around 3 times the wage to the company, than your inhouse cost,  for a junior developers customisation / bug fixing time.
  7. Due to historical reasons some types of software have far superior products that are all in one of these camps than the other ... So open source finance software is poor. Closed source web CMS software and repository software is poor, etc.
  8. Non-technical companies will go through a 5-10 year cycle of outsourcing as much software as possible, then auditing consultancy costs, then ballooning internal development to cut costs, then deciding too much development is in house back to outsourcing again. This cycle wastes a lot of money due to its lack of understanding of the benefits of a stable highly selective mixed economy for software of outsourced, open source, commercial and in-house as being the ideal balance of functionality vs. cost.
  9. Buying mix and match products from integrated product suites is a recipe for high cost, eg. MS Exchange Email and Google Docs, rather than all from one or other supplier.
  10. Lastly and most importantly a non-technical organisation always makes its software procurement decisions based on political reasons*. Never on technical ones. This invariably means that it makes decisions that are significantly more costly, difficult to maintain and less well featured than it could achieve using a purely technical assessment process.
    Usually they will also fail to have processes to properly trial run alternative products in a realistic manner, or to audit selections once the initial purchase is made. This may partly be because although auditing may save significant costs in the long run, it does introduce a means by which a wrong choice can be flagged up. Unfortunately it is often less embarrassing to make do with a bad choice, until its end of life, than admit a failure. Even though failing and acceptance of it as part of the process, is essential to delivery of quality (rather than make do) systems. 

Thank you ... rant over :-)

* political reasons - The salesman managed to persuade someone suitably senior that they were technically clueless enough to believe them. This usually goes in tandem with, company software team response ... the salesman promised them it did what?? ... make damn sure that isn't in the contract / licensing agreement.




Monday, June 3, 2013

IT Megameet

Yes MegaMeet may have a slightly cheesey ring to it, but the Bristol IT MegaMeet was a lot of fun, and a great idea for a regional software community event. So unlike most conferences this one is not for a particular company, language, platform or area of software expertise. Instead it brings together all the voluntary community software and technology groups within the region of Bristol, UK.

There are quite a number as it turns out, and so squeezing the conference into a single day resulted in 5 tracks. For a conference organised for the first time last year by a student to save his course - thanks Lyle Hopkins, it rather put our local University's official efforts in software community engagement to shame - however perhaps it might encourage them to rise to the challenge. (Lyle is a student at one, and I work at the other.)

So of the perhaps 30-40 software groups that are based in and around Bristol, over 20 were represented, a good turnout partly due to the efforts of one of Lyle's fellow volunteers, Indu Kaila, to do the leg work of attending all the local events and getting various members (like myself) to volunteer to represent their group at the event. So I am one of the hundred odd members of Bristol and Bath's Django User Group (DBBUG), started by Dan Fairs, and did a presentation about Python, Django, our group, and the process of contributing to open source - so rather a lot to pack into 40 minutes, but it seemed to go down OK.

There was the full range of enthusiast groups present, so I started the day finding out how the four colour theorem from map making applies to optimisation algorithms used in compilers, from the ACCU, who have been around for a very long time, starting out as a C programming community group. Then near the finish saw a good talk from Bristol Web folk reminding me about the core important issues to remember concerning front end web development - as more of a back end developer  it can be easy to label this stuff somebody else's job, but with an ever increasing slice of the stack being client side in web development, these days, that is clearly a bad attitude.

There was more than a smattering of javascript related talks going on, from big data CouchDB and node.js back end use, through to more client side, and a very popular session, flying helicopters via javascript code.

The talks were rounded off with some talks about the charity cause that the day was helping to raise funds for, a cross atlantic row in aid of cervical cancer charity (plus an appeal for graphic design work for another member of the volunteer team from the Ukraine, who is in need of health care).

I then found myself in the rather comical position of receiving two awards from the extensive award ceremony for community involvement, etc. Both really on behalf of other people, but it was fun and lead on to the free bar and barbecue, always a popular way to round off a conference.

So thanks to the Megameet team, if nobody else comes forward, I can always represent DBBUG, South West Big Data or perhaps another new local group, again next year!


Wednesday, November 14, 2012

Cookie law, Cookieless and django tips.

django-cookieless

Last week I released a new add on for django, django-cookieless, it was a relatively small feature that was required for a current project, and since it was such a generic package seemed ideal for open sourcing as a separate egg. It made me realise that I hadn't released a new open source package for well over a year, and so this one is certainly long over due in that sense.

Cookie Law

It also over due in another sense, EU Cookie law has been in force since May 2011, so legally any sites that are used in Europe, and set cookies which are not strictly necessary for the functioning of the site, must now request permission of the user before doing so. Of course it remains to be seen if any practical enforcement measures will happen, although they were due to this summer in the UK, for example. Hence many of the first rush of JavaScript pop up style solutions, have come and gone, as a result of user confusion. But for public sector clients particularly, it is certainly simpler to just not use cookies, if they are not technically required. Also it may at least, make developers rather less blasé about setting cookies.

Certainly most people would prefer not to see their browsers filled with deliberate user tracking and privacy invasive cookies that are entirely unrelated to the sites functionality. In the same way most of us don't like being tracked by CCTV everywhere we go ... unfortunately, the current Law doesn't have a good technical solution behind it, hence it may well founder over time. This is because cookie control is too esoteric for ordinary users, and even with easy browser based privacy levels configuration, any technical solutions are problematic, because a single cookie can be used to both protect privacy (in terms of security - e.g. a CSRF token) and invade it.  It is entirely down to the specific applications usage of it, where these distinctions lie. Invasive methods can also be implemented via other session maintenance tools, such as URL rewriting, yet because no data is written to the users browser, it is outside the remit of this Law, so the Law makes little sense currently, and may well be unenforceable.

Perhaps it would of been better to aim to set laws related to encouraging adherence to set standards of user tracking, starting with compliance with the browser added 'Do Not Track' header, perhaps adding some more subtle gradations over time. With the targets of the Law, being companies whose core business is user tracking for advertising sales etc., starting with Google and working down. Rather than pushing the least transgressive public service sector, as the most likely to comply, to add a bunch of annoying 'Will you accept our cookies?' pop ups.

However even if  this law dries up and blows away, for our particular purposes, we needed django to cater for any number of sessions per browser (as well as not using cookies for anonymous users).
Django's default session machinery requires cookies, so ties a browser to a single session - request.session set against a cookie. But because django-cookieless provides sessions maintainable by form posts, it automatically delivers multiple sessions per browser.

There are a number of security implications with not using cookies, which revolve around the difficulty of preventing session stealing without them. Given this is the case, django-cookieless has a range of settings to reduce that risk, but even so I wouldn't recommend using it for sessions that are tied to authenticated users, and hence could lead to privilege escalation, if the session were stolen.

Django Tips

I thought the egg would be done in a day, but in reality it took a few days, due to a number of iterations that were necessary as I discovered a series of features around the lesser known (well to me) parts of  django. So I thought I would share these below, in case, any of the tips I gained are useful ...

  1. The request object life cycle goes through three main states in django:
    unpopulated - the request that is around at the time of process_request type middleware hooks - before it gets passed by the URL handler to decorators and then views.
    partly populated - the request that has session, user and other data added to it (mainly by decorators) and gets passed to a view
    fully populated - the request that has been passed through the view to add its data, and is used to generate a response - this is the one that process_response sees.
  2. I needed to identify requests that were decorated with my no_cookies decorator at the time of process_request. But the flag it sets has not be set yet. However there is a useful utility to work around this, django.core.urlresolvers.resolve, which when passed a path, gives a match object containing the view function to be used, and hence its decorators, if any.
  3. Template Tags that use a request get the unpopulated one by default.  I needed request to have the session populated for the option of adding manual session tags - see the tags code, to have the partly populated request in tags django.core.context_processors.request must be added to the TEMPLATE_CONTEXT_PROCESSORS in settings.

  4. The django test framework's test browser is in effect a complex mocking tool to mock up the action of a real browser, however like any mock objects - it may not exactly replicate the behaviour one desires. In my case it only turns on session mocking if it finds the standard django session middleware in settings. In the case of cookieless it isn't there, because cookieless acts as a replacement for it, and a wrapper to use it for views undecorated with no_cookies. Hence I needed to use a trick to set a TESTING flag in settings - to allow for flipping cookieless on and off.

Tuesday, October 23, 2012

My struggles with closed source CMS

I produce a content migration tool for the Plone CMS, called ilrt.contentmigrator. It wraps up zope's generic setup as an easily configurable tool to export content from zope's bespoke object database, the ZODB, to a standard folder hierarchy of content with binary files and associated metadata in XML.

Some time ago I added a converter to push content to Google Sites, and I have recently been tasked with pushing it to a commercial CMS. Unfortunately rather than a few days work as before this has turned into a marathon task, which I am still unsure as to whether it is achievable, due to political and commercial constraints.

So I thought I should at least document my progress, or lack of, as a lesson for other naive open source habituated developers, to consider their estimates carefully when dealing with a small closed source code base, of which they have no experience.

Plan A - Use the API


So the first approach I assumed would be the simplest was to directly code a solution using "the API".

API is in quotes here, since in common with many small commercial software suppliers, the name API was in fact referring to an automated JavaDoc dump of all their code, there was no API abstraction layer, or external RESTful / SOAP API, to call. Its basically the equivalent of read the source for open source projects - but with the huge disadvantage of only legally having access to read the bare - largely uncommented - class and method names, not look at the source to see how they worked - or why they didn't.

Also no other customers had previously attempted to write code against the content creation part of the code base.

Anyhow back to the project, content import code was written and run, but nothing changed via the web front end.

It turns out that without a cache refresh the Admin interface does not display anything done via the API, hence it is essential to be able to determine if changes have occurred.

Similarly if content is not cleared from the waste-basket then it cannot be recreated in the same location, along the lines of a test import scenario.

Having written the code to set up the cache and other API managers and clear it. I discover that
cache refresh doesn't work via the API, neither does clearing the waste basket.

The only suggested solution was turn the CMS off and on again.

Plan B - Use the API and a Robot


Rather than resort to such a primitive approach, I decided to develop a Selenium, web driver based robot client. This could log into the CMS and run all the sequences of screen clicks that it takes to clear the waste-basket and cache after an API delete has been called.

Eventually all this was in place, now content could be created via the API, and media loaded via the robot (since again anything that may use local file system caches or file storage, is inoperable via the API).

The next stage was to create the folder hierarchy and populate it with content.

Unfortunately at this point a difficult to trace API bug reared its head. If a subfolder is created in a folder via the API, then it gets created in a corrupted manner, and block subsequent attempts to access content in that folder, because the subsection incorrectly registers itself as content - but is then found to be missing. After much time spent tracing this bug, the realisation dawned that, it would therefore not be viable to create anything but a subset of content objects via the API, and everything else would need the robot being mixed in to work.

This seemed like a much less maintainable solution. Especially since most pages of the CMS had 50 or more javascript files linked to them, so only a current browser WebDriver client robot would function at all with it. Even then, often the only way to get the robot clicks and submits to work was to grab the javascript calls out from the source and call the jQuery functions directly with the WebDriver javascript engine.

Plan C - Use the import tool and a Robot


So having wasted 3 days tracing that a bug was in the (closed source) API, it was time to take a step back and think about whether there was realistically a means by which an import tool could be created, by a developer outside of the company supplying the CMS, ie. me.

Fortunately the CMS already had an XML export / import tool. So all we need to do is convert our XML format to the one used by the company, and the rest of the code was their responsibility to maintain.

At first their salesman seemed fine with this solution. So I went away and started on that route. Having left the existing code at the point where the sub-folder creation API bug, blocks it working.

However on trying out the CMS tool, it also failed to work in a number of ways. The problems that it currently has are listed below, and my focus is presently writing a selenium based test suite, that will perform a simple folder, content and media export and import with it.

Once the tool passes, then we have confirmation that the API works (at least within the limited confines of its use within the tool). We can then write a converter for the XML format and driver for the tool / or even revisit the API + robot route, if its fixed.

Below are the issues, that need to work, and that the test suite is designed to confirm are functional ...

Content Exporter Issues (status in brackets)

  1. The folder hierarchy has to be exported separately from the content. If you select both at once - it throws an error (minor - run separately)
  2. The hierarchy export appends its data when exported to the same folder, so creating an invalid hierarchy.xml after the first run (minor - could delete on the file system in between) 
  3. Hierarchy export doesn't work. It just creates XML with the top level folder title wrapped in tags containing the default configuration, attributes - but no hierarchy of folders. (blocker - need the hierarchy especially to work, since the sub-folder creation was the blocking bug issue with using the API directly)
  4. Content export only does one part of one page at a time, ie. a single content item (minor - this means that it is not a very useful export tool for humans - however via a robot - it could be run hundreds of times to get a folder done)
  5. The embedded media export doesn't work, no media is created (blocker - we need to be able to do images and files)
  6. Content import - A single content item works - and if the media already exists with the right id, that works. Cant judge about media import - since media export fails so have not got a format to follow. (blocker - we need media to work as well as a minimum. Ideally we could import all the parts of a page in one go - or even more than one page, at once!)
  7. Hierarchy import - Creating a single section works. Cannot judge for subsections - since the export doesn't work. (pass?)
  8. Configuration changes can break the tool (blocker - the whole point of the project is to provide a working tool for a phased transition of content, it should work for a period of at least two years)
  9. Not sure if the tool can cope with anything but default T4 metadata (minor - A pain but the metadata changes to existing content are part of the API that should function OK directly, so could be done separately to the tools import of content.)
Once we have a consistently passing CMS tool, we can assess the best next steps.

The testing tool, has proved quite complex to create too, because of the javascript issues mentioned above, but this now successfully tests the run of an export of a section and a piece of content, checking the exported XML file, also running the import for these to confirm the functionality is currently at the level listed above.

Having been burnt by my experience so far, my intention is to convert the Plone export XML and files to the new CMS native XML format - push it to the live server and run the robot web browser to trigger its import, so that eventually we will have a viable migration route - as long as the suppliers ensure that their tool (and underlying API) are in working order.















Site code, Google Apps integration and design - Ed Crewe 2011