Monday 13 January 2014

Postgres character set conversion woes

I had to struggle with sorting out some badly encoded data in Postgresql over the last day or so.
This proved considerably more hassle than I expected, partly due to my ignorance of the correct syntax to use to convert textual data.

So on that basis I thought I would share my pain!

There are a number of issues with character sets in relational databases.

For a Postgres database the common answers often relate to fixing the encoding of the whole database. So if this is the problem the fixes are often just a matter of setting your client encoding to match that of the database. Or to dump the database then create a new one with the correct encoding set, and reload the dump.

However there are cases where the encoding is only problematic for certain fields in the database, or where you are creating views via database links between two live databases of different encodings - and so need to fix the encoding on the fly via these views.

Ideally you have two databases that are both correctly encoded, but just use different encodings.
If this is the case you can just use convert(data, 'encoding1', 'encoding2') for the relevant fields in the view.

Then you come to the sort of case I was dealing with. Where the encoding is too mashed for this to work. So where strings have been pushed in as raw byte formats that either don't relate to any proper encoding, or use different encodings in the same field.

In these cases any attempt to run a convert encoding function will fail, because there is no consistent 'encoding1'

The symptoms of such data is that it fails to display. So is sometimes its difficult to notice until
the system / programming language that is accessing the data throws encoding errors.
In my case the pgAdmin client failed to display the whole field so although the field appears blank, matches with like '%ok characs%' or length(field) still work OK. Whilst the normal psql command displayed all the characters except for the problem ones, which were just missing from the string.

This problem has two solutions:

1. Repeat the dump and rebuild approach with the correct encoding, but to write a custom script in Perl, Python or the like to fix the mashed encoding - assuming that the mashing is not so entirely random as to be fixable via an automated script*. If it isn't - then you either have to detect and chuck away bad data - or manually fix things!

2. Fix the problem fields via pl/sql, pl/python or pl/perl functions that process these to replace known problem characters in the data.

I chose to use pl/sql since I had a limited set of these problem characters, so didn't need the full functionality of Python or Perl. However in order for pl/sql to be able to handle the characters for fixing, I did need to turn the problem fields into raw byte format.

I found that the conversion back and forth to bytea was not well documented, although the built in functions to do so were relatively straight forward...

Text to Byte conversion => text_field::bytea

Byte to Text conversion => encode(text_field::bytea, 'escape')

So employing these for fixing the freaky characters that were used in place of escaping quotes in my source data ...

CREATE OR REPLACE FUNCTION encode_utf8(text)
  RETURNS text AS
$BODY$
DECLARE
    encoding TEXT;
BEGIN
    -- single quote as superscript a underline and Yen characters              
                                            
    IF position('\xaa'::bytea in $1::TEXT::BYTEA) > 0 THEN
        RETURN encode(overlay($1::TEXT::BYTEA placing E'\x27'::bytea from position('\xaa'::bytea in $1::TEXT::BYTEA) for 1), 'escape');
    END IF;

    -- double quote as capital angstroms character                                                                                                                              
    IF position('\xa5'::bytea in $1::TEXT::BYTEA) > 0 THEN
        RETURN encode(overlay($1::TEXT::BYTEA placing E'\x22'::bytea from position('\xa5'::bytea in $1::TEXT::BYTEA) for 1), 'escape');
    END IF;
    RETURN $1;
END;
$BODY$

Unfortunately the Postgres byte string functions don't include an equivalent to a string replace and the above function assumes just one  problem character per field (my use case), but it could be adapted to loop through each character and fix it via use of overlay.
So the function above allows for dynamic data fixing of improperly encoded text in views from a legacy database that is still in use - via a database link to a current UTF8 database.

* For example in Python you could employ chardet to autodetect possible encoding and apply conversions per field (or even per character)

No comments:

Post a Comment