Monday 13 January 2014

Postgres character set conversion woes

I had to struggle with sorting out some badly encoded data in Postgresql over the last day or so.
This proved considerably more hassle than I expected, partly due to my ignorance of the correct syntax to use to convert textual data.

So on that basis I thought I would share my pain!

There are a number of issues with character sets in relational databases.

For a Postgres database the common answers often relate to fixing the encoding of the whole database. So if this is the problem the fixes are often just a matter of setting your client encoding to match that of the database. Or to dump the database then create a new one with the correct encoding set, and reload the dump.

However there are cases where the encoding is only problematic for certain fields in the database, or where you are creating views via database links between two live databases of different encodings - and so need to fix the encoding on the fly via these views.

Ideally you have two databases that are both correctly encoded, but just use different encodings.
If this is the case you can just use convert(data, 'encoding1', 'encoding2') for the relevant fields in the view.

Then you come to the sort of case I was dealing with. Where the encoding is too mashed for this to work. So where strings have been pushed in as raw byte formats that either don't relate to any proper encoding, or use different encodings in the same field.

In these cases any attempt to run a convert encoding function will fail, because there is no consistent 'encoding1'

The symptoms of such data is that it fails to display. So is sometimes its difficult to notice until
the system / programming language that is accessing the data throws encoding errors.
In my case the pgAdmin client failed to display the whole field so although the field appears blank, matches with like '%ok characs%' or length(field) still work OK. Whilst the normal psql command displayed all the characters except for the problem ones, which were just missing from the string.

This problem has two solutions:

1. Repeat the dump and rebuild approach with the correct encoding, but to write a custom script in Perl, Python or the like to fix the mashed encoding - assuming that the mashing is not so entirely random as to be fixable via an automated script*. If it isn't - then you either have to detect and chuck away bad data - or manually fix things!

2. Fix the problem fields via pl/sql, pl/python or pl/perl functions that process these to replace known problem characters in the data.

I chose to use pl/sql since I had a limited set of these problem characters, so didn't need the full functionality of Python or Perl. However in order for pl/sql to be able to handle the characters for fixing, I did need to turn the problem fields into raw byte format.

I found that the conversion back and forth to bytea was not well documented, although the built in functions to do so were relatively straight forward...

Text to Byte conversion => text_field::bytea

Byte to Text conversion => encode(text_field::bytea, 'escape')

So employing these for fixing the freaky characters that were used in place of escaping quotes in my source data ...

CREATE OR REPLACE FUNCTION encode_utf8(text)
  RETURNS text AS
$BODY$
DECLARE
    encoding TEXT;
BEGIN
    -- single quote as superscript a underline and Yen characters              
                                            
    IF position('\xaa'::bytea in $1::TEXT::BYTEA) > 0 THEN
        RETURN encode(overlay($1::TEXT::BYTEA placing E'\x27'::bytea from position('\xaa'::bytea in $1::TEXT::BYTEA) for 1), 'escape');
    END IF;

    -- double quote as capital angstroms character                                                                                                                              
    IF position('\xa5'::bytea in $1::TEXT::BYTEA) > 0 THEN
        RETURN encode(overlay($1::TEXT::BYTEA placing E'\x22'::bytea from position('\xa5'::bytea in $1::TEXT::BYTEA) for 1), 'escape');
    END IF;
    RETURN $1;
END;
$BODY$

Unfortunately the Postgres byte string functions don't include an equivalent to a string replace and the above function assumes just one  problem character per field (my use case), but it could be adapted to loop through each character and fix it via use of overlay.
So the function above allows for dynamic data fixing of improperly encoded text in views from a legacy database that is still in use - via a database link to a current UTF8 database.

* For example in Python you could employ chardet to autodetect possible encoding and apply conversions per field (or even per character)

Monday 6 January 2014

WSGI functional benchmark for a Django Survey Application

I am currently involved in the redevelopment of a survey creation tool, that is used by most of the UK University sector. The application is being redeveloped in Django, creating surveys in Postgresql and writing the completed survey data to Cassandra.
The core performance bottleneck is likely to be the number of concurrent users who can simultaneously complete surveys. As part of the test tool suite we have created a custom Django command that uses a browser robot to complete any survey with dummy data.
I realised when commencing this WSGI performance investigation that this functional testing tool could be adapted to act as a load testing tool.
So rather than just getting general request statistics - I could get much more relevant survey completion load data.

There are a number of more thorough benchmark posts of raw pages using a wider range of WSGI servers - eg. http://nichol.as/benchmark-of-python-web-servers , however they do not focus so much on the most common ones that  serve Django applications, or address the configuration details of those servers. So though less thorough, I hope this post is also of use.

The standard configuration to run Django in production is the dual web server set up. In fact Django is pretty much designed to be run that way, with contrib apps such as static files provided to collect images, javascript, etc. for serving separately to the code. Recognizing that in production a web server optimized for serving static files is going to be very different from one optimized for a language runtime environment, even if they are the same web server, eg. Apache. So ideally it would be delivered via two differently configured, separate server Apaches. A fast and light static configured Apache on high I/O hardware, and a mod_wsgi configured Apache on large memory hardware. In practise Nginx may be easier to configure for static serving, or for a larger globally used app, perhaps a CDN.
This is no different from optimising any web application runtime, such as Java Tomcat. Separate static file serving always offers superior performance.

However these survey completion tests, are not testing static serving, simpler load tests suffice for that purpose. They are testing the WSGI runtime performance for a particular Django application.

Conclusions

Well you can draw your own, for what load you require, of a given set hardware resource! You could of course just upgrade your hardware :-)

However clearly uWSGI is best for consistent performance at high loads, but
Apache MPM worker outperforms it when the load is not so high. This is likely to be due to the slightly higher memory per thread that Apache uses compared to uWSGI

Using the default Apache MPM process may be OK, but can make you much more open to DOS attacks, via a nasty performance brick wall. Whilst daemon mode may result in more timeout fails as overloading occurs.

Gunicorn is all Python so easier to set up for multiple django projects on the same hardware, and performs consistently across different loads, if not quite as fast overall.

I also tried a couple of other python web servers, eg. tornado, but the best I could get was over twice as slow as these three servers, they may well have been configured  incorrectly, or be less suited to Django, either way I did not pursue them.

Oh and what will we use?

Well probably Apache MPM worker will do the trick for us, with a separate proxy front-end Apache configured for static file serving.
At least that way, its all the same server that we need to support, and one that we are already well experienced in. Also our static file demands are unlikely to be sufficient to warrant use of Nginx or a CDN.

I hope that these tests may help you, if not make a decision, maybe at least decide to try out testing a few WSGI servers and configs, for yourself. Let me know if your results differ widely from mine. Especially if there are some vital performance related configuration options I missed!

Running the functional load test

To run the survey completion tool via number of concurrent users and collect stat.s on this, I wrapped it up in test scripts for locust.

So each user completes one each of seven test surveys.
The locust server can then be handed the number of concurrent users to test with and the test run fired off for 5 minutes, over which time around 3-4000 surveys are completed.

The number of concurrent users tested with was 10, 50 and 100
With our current traffic peak loads will probably be around the 20 users mark with averages of 5 to 10 users. However there are occasional peaks higher than that. Ideally with the new system we will start to see higher traffic, where the 100 benchmark may be of more relevance.

Fails

A number of bad configs for the servers produced a lot of fails, but with a good config these seem to be very low. So all 3 x 5 minute test runs for each setup created around 10,000 surveys, these are the actual number of fails in 10,000
so insignificant perhaps ...

Apache MPM process = 1
Apache MPM worker = 0
Apache Daemon = 4
uWSGI = 0
Gunicorn = 1

(so the fastest two configs both had no fails, because neither ever timed out)

Configurations

The test servers were run on the same virtual machine, the spec of which was
a 4 x Intel 2.4 GHz CPU machine with  4Gb RAM
So optimum workers / processes = 2 * CPUs + 1= 9

The following configurations were arrived at by tinkering with the settings for each server until optimal speed was achieved for 10 concurrent users.
Clearly this empirical approach may result in very different settings for your hardware, but at least it gives some idea of the appropriate settings - for a certain CPU / memory spec. server.

For Apache I found things such as WSGIApplicationGroup being set or not was important, hence its inclusion, with a 20% improvement when on for MPM prefork or daemon mode, or off for MPM worker mode.

Apache mod_wsgi prefork

WSGIScriptAlias / /virtualenv/bin/django.wsgi
WSGIApplicationGroup %{GLOBAL}

Apache mod_wsgi worker

WSGIScriptAlias / /virtualenv/bin/django.wsgi

<IfModule mpm_worker_module>
#  ThreadLimit    1000
    StartServers         10
    ServerLimit          16
    MaxClients          400
    MinSpareThreads      25
    MaxSpareThreads     375
    ThreadsPerChild      25
    MaxRequestsPerChild   0
</IfModule>

Apache mod_wsgi daemon

WSGIScriptAlias / /virtualenv/bin/django.wsgi
WSGIApplicationGroup %{GLOBAL}

WSGIDaemonProcess testwsgi \
    python-path=/virtualenv/lib/python2.7/site-packages \
    user=testwsgi group=testwsgi \
    processes=9 threads=25 umask=0002 \
    home=/usr/local/projects/testwsgi/WWW \
    maximum-requests=0

WSGIProcessGroup testwsgi

uWSGI

uwsgi --http :8000  --wsgi-file wsgi.py --chdir /virtualenv/bin \
                               --workers=9 --buffer-size=16384 --disable-logging


Gunicorn

django-admin.py run_gunicorn -b :8000 --workers=9 --keep-alive=5