Sunday, August 2, 2015

Using Java Spring & MyBatis for dynamic schema integration

Using Spring MyBatis for dynamic schema integration may perhaps be subtitled, "Eating soup with a fork" ... since that is how it has felt at times. However that may partially be due to my lack of familiarity with these tools. So this blog posting is about why I got that feeling, how I bent my fork to the purpose ... and if I missed some tricks in doing that. Please let me know by commenting below :-)

Object-relational mapping - a personal history

I have been developing relational database applications for many years so I have been through the various stages of database persistence approaches. Starting with fairly raw SQL development or at most just fixed DAO wrapping. Through SQL templating languages and mapping approaches to largely adopting the full object-relational mapping, ORM, approach for the past few years. I even tended to use the same pseudo ORM top level for NoSQL sources, although of course a lot of ORM features then become inapplicable, such as use of an ORM model's relational navigation (eg. person.department.head.surname)

Having said that the object-relational impedance mismatch can still bite. However good an ORM is there is some need for understanding of what its API simplifications and abstractions are doing. To avoid excessive  / badly performant SQL. But given that caveat, the amount of time saved in development and maintenance by a well implemented full ORM inevitably saves money on all but the simplest persistence requirements. Similarly the automation involved in schema migration, constraint, indexing and other data modelling needs is invaluable. As long as one knows what it is doing under the hood and hence when that 10-20% of database development / customised data design is required. For example custom functions and triggers, data writable views that present as a table for the ORM, etc.

MyBatis - is it an ORM I see before me?

Recently I have started working with a team who use an older Java technology set of Spring and MyBatis. My current project uses the Java MyBatis SQL framework to integrate various relational database sources.
MyBatis badges itself as a persistence framework. So it is not really claiming to be an ORM. It started as an SQL templating engine and that is still where its core skills lie. But it now has many other features bolted on. Including meta programming to generate Java relational mappers from database sources.

So it does have many ORMish features. However using MyBatis has been rather a painful journey for me, since I have found that it fails to implement many ORM design pattern principles that my ORM habituation had come to expect. Also using MyBatis with Spring MVC means we havent got proper MVC since the model is not a model as it would be with Hibernate ... its a MyBatis Mapper.
This is not just a minor niggle, on first use a core ORM feature is gone. A Mapper is not an object model of a table's row of data, with transaction and session management under the hood. Instead it is just a convenient object to hook up standard data queries and updates to with the management of saves and synchronisation of actions on Mappers and results returned by them, entirely manual. So in that sense you have less than what some more basic DAO wrappers give you, ie. when you update a record then fetch the values from that record they will return the data written into and queried back from the database, hence an inserted saved model would automatically hold the new primary key(s).
With Mappers the data readable from the Mapper will just be the data set to be updated not the actual data held in the database as a result of that update. To find out the result of your query you need to ensure you get a fresh query result, by requerying or flushing session.

There is a work around for this common requirement, you can annotate an insert with a custom select query to tell MyBatis to re-query just for the key manually before or after doing the update. However I found this didn't work well for Oracle and was a problem anyway with what I required. Largely because I was using Generator based classes, so adding custom annotations to these would not have worked for me.

@SelectKey(statement="SELECT MY_SEQUENCE.NEXTVAL() FROM DUAL", keyProperty="nameId", before=true, resultType=int.class) 
int insertMyTable(Name name); 

Similarly there is no lazy update concept, updates are only and always done when you specifically run them. In effect its like using an ORM but disabling all the relational modelling, data synchronisation machinery and transactional management. Mappers are instead just convenience wrappers for running SQL - MyBatis Generator may do the ORM process of creating Java classes that (via XML) map to database tables or views. But mapper instances are not ORM synchronised table row objects.

The normal object centric approach is also not available. Whilst you can generate code from your database, you cannot generate your database from code. So full persistence data cycle management is not available out of the box. By that I mean the use of test harness and deployment tools that destroy and recreate the full database, populate it from fixtures, run tests and drop it again. A migration tool that introspects databases and can migrate back and fortbr /br /br /h between schema versions to sync them with code versions or vice versa, ie code -> data, schema management, in addition to doing the MyBatis Generator data -> code direction. So the sort of thing that a full stack web framework like Django has with its ORM.

This was a blow for my particular project since full data cycle management is exactly what I required for developing a central data aggregation and integration database. I also required proper fixture management for testing and the same serialized data input / output components of this being used for the main work of data aggregation from data sources in serialized format (XML, CSV etc.)
Finally the system was dealing with pulling in and delivering back out complex, evolving schemas from source to consumer databases. The one thing that would be essential is that all the schema handling and data typing be dynamic so that the code can handle this change rate without constantly breaking and having to be rewritten. So hard coded schema details and static typing must be avoided for any chance of maintainability. But it was to be written in a statically typed language. Again the tool was not suited to the job. So how did I bend this seemingly unsuitable platform to the task in hand.

Bending the fork

The first step was to tackle the dynamic schema vs. static typing issue. Java reflection is your friend here. Then reinvent some of the missing ORM wheels. Fixture loading, schema generation.
Along with working around the lack of object data synchronisation and automatic session management. Finally we wanted to make our data population speedy by only updating data that needed updates. So a hashing mechanism that could work with the serialized data was required.
Note that our in house Java standards decree that all configuration should be done via annotations rather than XML where available ... so that is pretty much everything in Spring ... aside from the core MyBatis Mappers XML files. The aggregation database was Oracle and various database's including MS SQL Server were the data sources.

Dynamic schema handling

Use MyBatis Generator, adding a config file for each database source and the target aggregation database.
Add the name of each table or view required for them to these config files ... eg.
<table tableName="PERSON" modelType="flat"></table>

ideally that would be all the schema customisation required as long as data source and target naming of tables and columns matched. However in a handful of cases mapping data (eg. tableA.colB = tableC.colD) was needed in the properties files.

I ran this as part of the Maven build at first, but it proved more manageable to be able to run code generation for the sources and target databases at different times, so  I switched to wrapping this up in a command class for triggering independently via a batch job scheduler. In the same way as the data population commands were to be run. Schema -> code generation tended to be much faster than either data population or code -> schema generation, so freqprepreuent execution was not an issue.

The core class I wrote for enabling dynamic schema handling was called TableMeta and each table has an instance of this class listing all the metadata about its columns and primary keys etc.

This class was injected into a ModelFactoryService whose job is to return results or add, edit and delete models from any Mapper.
The Factory Service has a reflection based method for invoking the MyBatis generated Mapper methods, e.g. "SelectByExample"  ...

  protected Object invokeMapperMethod(String methodName, Object parameter) {
    Class klass = exampleClass; 
    if (parameter != null) {
      klass = parameter.getClass();
    } else {
      parameter = newExample();
    }
    try {
      Method method = mapperClass.getDeclaredMethod(methodName, klass);
      try {
        // Run the method  
        return method.invoke(mapper, parameter);
      } catch (IllegalAccessException | IllegalArgumentException exRun) {
        Logger.getLogger("ModelFactoryService").log(
         Level.SEVERE, modelClass.getSimpleName() + "." + methodName, exRun);
      }
    } catch (NoSuchMethodException | SecurityException exCall) {
      Logger.getLogger("ModelFactoryService").log(
         Level.SEVERE, "Mapper class has no " + methodName + " method", exCall);
    }
    return 0;
 }


... similarly there were reflection based set and get column methods surfaced by whole Mapper data modification methods such as doUpdate(Object model)
The TableMeta caters for initial setup of the classes generated for that particular table by MyBatis Generator. Straight use of Class.forName(className) works with manipulation of the table name string copying that of the Generator's standard naming convention. The table's Mapper, Model and Example classes are then added as the template classes for the Factory. From that a Setters and Getters hash of the table's Mapper methods for each column can be built automatically.

Fixture loading

A sax parser based loader class reads in data fixtures in XML format. Each fixture row is parsed to a hash of tag name to value which can be passed via a SaverService to the ModelFactory.
An ObjectConvertor static class caters for de-serialising fixture data to the correct Java types looked up via the column name from TableMeta, and the Setter hash can then be used to update the matching named columns in the model.

Fixture dumping

For fixture dumping I cheated a little. Since I was only ever dumping from the target Oracle database then rather than reinventing a full serialisation of models (e.g. ORM with built in serialisation from any database to xml, json, yaml, csv etc.) I just provided a tool to do the minimum required for my testing and development needs - to just be able to serialise any table or view to XML from Oracle.
So in this case, to avoid the maintenance madness of a separate statically typed fixed schema solution for each table - I did the XML part in Oracle.
That way I could use a GenericMapper which generated two string fields based on a dynamic SQL query built from the TableMeta. The first being the row XML and the other the concatentated list of the serialised columns to be contained within it. MyBatis @SelectProvider annotation allows the gluing on of a method taken from another class to generate SQL ...

 @SelectProvider(type = TableDump.class, method = "getXMLProvider")
 List<GenericModel> getXML(@Param("meta") TableMeta meta, @Param("rows") final int rows);

The method glued on is this one which uses Oracle's native XML methods and the TableMeta's list of columns to generate a query that directly returns the XML..

 /**
   * Dynamic SQL generator method generates XML output for fixture as row and cols 
   */
  public String getXMLProvider(Map params) {
    final TableMeta meta = (TableMeta) params.get("meta");
    final int rows = (int) params.get("rows");

    String sql = new SQL() {
      {
        String rowXML = " '' || XMLElement(\"row\", XMLAttributes(";
        for (String col : meta.getPKeys()) {
          rowXML += col + " AS \"" + col.toLowerCase() + "\", ";
        }
        rowXML = rowXML.substring(0, rowXML.length() - 2) + ")) \"ROWKEYS\"";
        SELECT(rowXML);
        String queryXML = "''";
        col_loop:
        for (String col : meta.getCols()) {
          queryXML += " || XMLElement(\"" + col.toLowerCase() + "\", " + col + ")";
        }
        SELECT(queryXML + " \"ROWCOLS\"");
        FROM(meta.getTableName());
        for (String col : meta.getPKeys()) {
          ORDER_BY(col);
        }
        if (rows > 0) {
          WHERE("rownum <= " + rows);
        }
      }
    }.toString();
    return sql;
  }

This above snippet shows an example of MyBatis SQL templating - the core of MyBatis.
The result is a single Generic Mapper that by calling with the appropriate tables TableMeta can return an XML dump of any table in the target database. For test fixture generation this can be passed a result length ... since usually ten or twenty will be sufficient for integration testing.

Save methods

I won't go into the save methods in detail - suffice to say the data was hashed on the way into Oracle and then a query of primary keys to last modified hashed data returned of all rows to allow incremental update by hashing each of the XML data sources rows and checking it first to determine if an update or insert was required. Whilst this could not take advantage of databases or Java's hashing - since it had to be serialised data compatible - it did have a significant impact. Since the rate of incremental updates meant that we are getting a maximum of 1% data churn per table, so even with all the hashing and comparision overhead - an incremental update is still around 30 times faster.

For versioned data a core data table and a versioning table was required and this then needed both to get access to the new primary keys. As mentioned this is not straight forward in MyBatis - so the easiest solution was to have a bespoke version Mapper that just wrapped a single versioning sequence and could be used to update the related data and version tables by calling custom VersionMappers nextVal or currVal methods.

Code Schema Cycle

As mentioned MyBatis doesnt cover the code to schema half of the persistence cycle, so the option was to employ a separate full schema life cycle framework such as Liquibase. Or roll my own. In this case my requirements were not for extensive migration features. Since as an aggregation database the schema could be snapshotted dropped and rebuilt. So to avoid complexity and further dependencies I just added a tool to build Oracle schema from standard dumps of the schema from an Oracle client tool. So you just dump each object to a separate DDL SQL file, then it checks all the files works out which object type they relate to and loads them in a sequence which should prevent referential integrity clashes, ie.

DATABASE LINK, SEQUENCE, TABLE, MATERIALISED VIEW, FUNCTION, VIEW,  INDEX, PROCEDURE, TRIGGER, CONSTRAINT, FOREIGN KEY

This is wrapped up as a DatabaseReset command that either drops all data or the full database and rebuilds the schema. So a run of  reset followed by the GenerateMappers command gives a freshly built persistence layer and Model code to talk to it. It would be nice to have the schema built from the Mappers code and a more standard coherent code schema cycle, but given Mappers are in use, that would not be available even if Liquibase were in the mix.

Spring issues

So as a newbie to MyBatis I guess I had some problems with it. But as long as this slightly irascible post is. Guess what, I also had issues with Spring too ... so I will make it even longer by getting those off my chest too :-)

It seems to require huge amounts of configuration to do what you want. Again it can be made to do most things, but ... and maybe to some extent ... because of that ... it takes an awful lot of configuration to do some of the things a more opinionated full stack web framework would do out of the box. I guess I need to come across the sort of edge cases which are really difficult in a full stack framework, to appreciate Spring, since currently it feels like it needs a great deal of maintenance heavy tinkering to do some of the basics.

So the first surprise was that all my Spring @Service classes default to singletons. So in order to have for example a ModelFactory for the two different tables a versioned update required I needed to build a BeanFactory and make these beans and annotate them as prototypes - ie normal classes not singletons. I guess this is because all of this related to command classes not Spring MVC web classes ... which would have had web session scope.

  @Bean
  @Scope(value = ConfigurableBeanFactory.SCOPE_PROTOTYPE)
  public ModelFactoryService modelService() {
    return new ModelFactoryService(prop, hashService());
  }

So all my indvidually injected service classes tended to have to be uninjected and instead the bean factory injected in their place.

Along with that the database session to mapper class connection seemed rather frail.
The only way to ensure this worked was never to use a generic session handling method but always use separate prototype beans for separately named session classes annotated to find the correct mappers for that particular database ...

@Configuration
@MapperScan(basePackages = "uk.ac.bris.adt.erp.dataint.sources.mm", sqlSessionFactoryRef = "MMSessionFactory")
public class MMDBConfig extends DBConfigABC { ... }

... not only that the directory hierarchy matters due to the niceties of MyBatis-Spring , since the MapperScan will find anything at or below a directory ... so you cannot put mappers for one connection below the point you need to scan for another ... or they will be sucked up as Mappers for the wrong session.

Test configuration

Also the test configuration for integration tests required a lot of manual annotations to pick up appropriate configuration environment in order to inject things into the context in a working manner to match the running code configuration. Again I had been spoilt by expecting a framework specific automatically working test harness with default test customisations of the runtime code environment thrown in. Instead each integration test needed to use a wrapper that called a RunnerBeanConfig

@PropertySource("classpath:test.properties")
@RunWith(SpringJUnit4ClassRunner.class)
@ContextConfiguration(loader = AnnotationConfigContextLoader.class, classes = ITRunnerBeanConfig.class)
public class ModelFactoryServiceIT extends ServiceABC { ... }

The RunnerBeanConfig then needs to have all the database and resources configs annotated as Imports or it cannot find the different database sessions. Only after all that is it possible to inject the runtime command classes for testing.

@ComponentScan("uk.ac.bris.adt.erp.dataint")
@PropertySource("classpath:test.properties")
@Configuration
@Import({ DataIntDBConfig.class, MMDBConfig.class ... ResourceConfig.class })
public class ITRunnerBeanConfig implements EnvironmentAware {

  private Environment env;

  @Override
  public void setEnvironment(Environment environment) {
    this.env = environment;
  }
  ...
}

Conclusion

OK so I got it working in the end. However it all felt rather like I was doing something rather bespoke and complicated. This added a great deal of development time (or more accurately configuration wrangling time) to deliver components of persistence layer management and testing - that I had previously expected would all be available already in a mature high level framework. So in summary I probably have to conclude that if you needed to do this task in a more maintainable and standardised manner with a minimum of custom code. Don't use Spring and MyBatis. However if you already use one of those tools ... and you need to add this sort of functionality with them. Then it is certainly doable, and perhaps if you are a newbie to them, like me, this post may be of some use in speeding the job up for you :-}

Thursday, July 30, 2015

Picking a language for shell scripting based on its framework ecosystem


  • Do you work somewhere that has a lot of old shell scripts? 
  • Were they all dashed off quickly according to the whims of their original creator - just deploy scripts so maybe they missed some of the care (and documentation) the application code had?
  • Whenever a major revision of said shell scripts is needed, do all of them end up in the bin - sorry trash in case you thought I meant the /usr/bin. Because its easier to write such platform and author specific scripts from scratch than maintain them?

Well the team I recently joined does. So as somebody who personally moved from shell scripts to a shell compatible object orientated scripting language some time ago. Then on to a shell framework, I was asked to assess what to move our menagerie of bash, zshell, cshell, powershell etc. scripts to.

My remit was to recommend a language instead of all those developer specific procedural shell scripts. After all the point of a mainstream language is to introduce cross platform code with a huge set of useful libraries, along with hopefully some standard approaches. Hence get above the sea of shells.

However my preference is to move the team to using a shell framework.

Because what a framework does is give you opinions. It chooses how to do stuff, so you ... and all the developers that come after you, don't have to. In my opinion most of the productivity comes from everyone becoming familiar with the same way of doing things ... whether or not its the best way
(and if it isn't the best way the framework developers spend a lot of time fixing that way with hopefully the minimum of disruption to its API for the end users).

I felt it would be useful to see how active the shell framework ecosystem of a language was, in order to assess how suited to shell scripting it is. Since the goal of a shell framework is essentially the same goal, standardisation of shell scripts. So one indicator for the most suited language is to list the mainstream shell frameworks for the contenders and see which comes out top.

It was decided we will choose from Javascript, Python or Perl. So this is for a team that almost exclusively use Java, which due to its pre-compiled statically typed nature is inherently poorly suited to shell scripting.
Unfortunately Ruby was not to be a candidate. However I have added it anyway, because if its good enough for Amazon, and leading configuration management frameworks such as Puppet ... it should be included.

Strangely it is difficult to find definitive lists of shell frameworks ... perhaps whilst I thought the term had been around for 10 years or more, it has not been coined very effectively ... its not even on Wikipedia ... maybe I made it up! Anyway they exist whether or not there is a proper name for them ... they also often cross over the grey area into being full blown configuration management tools.
So to avoid that argument ... I will add those in as well. I have marked entries with a * where I think it is more a config management tool than a shell framework.

For configuration management systems wikipedia does have an entry ... so using it as a similar benchmark it gives a relative language score of:

Python 11, C 5, Perl 5, Java 4, Ruby 3,
and 1 each for C++, Scala, Erlang, Go and PHP  (none for Javascript)

So on the same sort of  basis I have ranked the languages with the most healthy shell framework ecosystem ... in the belief that it is a good indication that they may be the most suited to being used for shell scripting ...

Python 8

Python has the largest number of well established shell frameworks, as it does in terms of configuration management systems. So on that basis I could award it first prize as the most used language for standardising shell scripting. However numbers are not everything. It doesn't necessarily have the leaders of those two categories of software (see Ruby below)
  1. fabric
  2. cliff
  3. python-deploy-framework
  4. clustershell
  5. builtoncement
  6. ansible *
  7. salt *
  8. bcfg *

Ruby 6

Ruby has capistrano which is possibly the most popular of all shell frameworks, it also has two of the most popular configuration management systems, Puppet and Chef, so on that basis it should perhaps be first or at least joint first with Python. As the most appropriate language for modern shell scripting.

Perl 3

Perl has things that perhaps the other languages don't in related areas. So a number of shell implementations in Perl, and frameworks for writing custom shells. But in terms of what I mean by a shell framework - ie. a tool to run shell commands in a standard way across different servers, shells and platforms it is surprisingly lacking.

Javascript 2

Ummm well Javascript may have broken free of the browser with node.js and other server side runtime environments. But are there any shell frameworks or configuration management systems built with it? ie. tools designed for running shell commands and doing anything other than Javascript installation. Not much TBH and certainly no configuration management engine.
  1. commander.js   (clone of Ruby commander)
  2. shelljs is the leading bunch of utilities for node.js that can act as a shell framework


Conclusion

Its a tie between Python and Ruby, so the decision as to what to go for depends on the individual tools that best suit your shell framework requirements, what they may need to integrate with wrt. configuration management and deployment tools ... plus the existing skills of your developers.
Oh and whether diversity is more important than being the market leader ... or vice versa.

For my own case, Ruby isn't in the running, so that makes it Python based on this premise.
I may follow up this blog post with one comparing the languages wrt. a sample shell script - and see which looks the most maintainable. As a second means of deciding between them.

Note: One other factor is related virtual frameworks ( ... yep made that term up too!)
Basically these wrap up configuration and deployment of virtualised software deployments - either hypervisors or containers - to give a full platform application build - usually for development purposes.
So here Ruby is well ahead with Vagrant along with box grinder

Sunday, November 16, 2014

The 10 commandments of maintainable web services

Here is a list of the ten core elements needed for a development to deployment phase infrastructure to provide a stable service for your web applications. Along with minimizing time wasted on bugs and issues that are unrelated to functional development, and slashing the maintenance time and cost - compared to systems without them. I guess it could also be called automation, automation, automation ...

It should be noted that just because an application is a legacy one. It does not mean that all of this infrastructure cannot be retro fitted to it. *
  1. Standard environment
    A set of consistently built and upgraded deployment phase environments - dev, demo, train, prod for the full application stack - e.g. app server, cache, web server and storage. All development and deployment is done on these entirely standard (ideally config management / virtualised) cloned environments. If random desktop / laptop computers must be used then ideally a virtual box build version should be provided for dev, to match the deployment ones.
    For web applications the server side will be a single environment, but if client side software is involved it may require multiple standard environments for build and test.
  2. Automated build
    Run one command or press one button to create a full application stack instance on any of the deployment environments. Including production. So this should be everything above the standard environment and ideally include storage too (see data automation). Each developer can build numbers of deployment instances in the same automated fashion. Builds should be remotely runnable for plugging into Continuous Integration, C.I., servers etc.
  3. Automated release management
    Particularly important is that no manual tasks are needed to deploy to production. A push button C.I. driven deploy should be used where each deployment is retained in a full log accompanied by summary deployment note and related software packages release history and source tag. This full logging of changes ties into software service change management concepts. If unforeseen dependency system issues develop a lot later, they can then potentially be tied to the highly detailed timestamped change logging that this provides.
    Automating the roll-out means that you should also automate the roll back. You hopefully will test well enough not to need a safety net, but not bothering to use one is reckless.
    Another common loop hole is that release only covers the application layer. The standard environment, storage etc. are all part of the stack and changes in them are also releases, and need the same release management controls in place.
  4. Revision control of the entire application stack
    Everything in the application should be versioned. So all the source code of course. But also all the deployment and automation code. The third party components should all have their own versions (if not download, version and deploy from your own local repository). That includes the application specific environment configuration, eg. Apache virtual host configuration.
    Build automation should allow specification of tags (or to a date or to any previous release - logged via C.I.)
    The code dependency stack should also be versioned - so the versions of every component for a system release. Language specific build tools such as Maven, Pip, Ant, Phing, Bundler, Buildout etc. provide this. The standard environment(s) should also be versioned via their config management tool. 
  5. Integrated documentation
    Core documentation should be written and versioned with the source code, each package should at least have README and a release HISTORY tied to each production releases version number. These need to be kept up to date with the rest of the source. Separate wikis for fuller / less technical docs are fine - but documentation of changes in functional specification need to use the same version control as the code - unless all your code has rigorous processes around a versioning integrated issue tracker - that is most reliably done by putting documentation in the code.
    Ideally the language's packaging tools should have a system to extract embedded documentation and comments into HTML on a software repository server - for easy reference.
    Automation to keep the web documentation up to date should be implemented. 
  6. Software upgrade process
    Major version platform upgrades should always be performed within a year of release date. Not just for security patch reasons (these must be carried out within a month at most). Ideally the former within a few months and the latter within a few days. Any longer and code divergence can make the upgrade hill too big a cost to scale, or compromise systems data. Major language / framework (as well as releases) - should not require significant system outages. These may not be automated to set up, but they should be automated to roll over between upgrades - so if you are without a multi-server load balanced layer in part of your application stack - then downtime should still be under a minute at most, e.g. an Apache or Database restart.
  7. Automated testing
    They may not provide great coverage but a minimal test suite is a necessity to allow confirmation of success for the automation infrastructure.
    Good test coverage means that complex functional errors or regressions can be written as tests and added to periodic builds - so ensuring that future releases are free of them - but a set of minimal functional or black box tests are sufficient to cover basic confirmation that automated environment upgrades, or minor application fix releases do not caused critical failures. These tests can also be tied to monitoring / timed load testing - to check upcoming releases for performance regressions.
  8. Data automationThis involves data fixtures, automated schema generation and synchronisation.
    With the advent of an object relational mapper (ORM) as standard in today's web applications. Then your system should have a full data abstraction layer, in even the most micro web application framework. In turn that means today's application code should contain within it the means to generate all of the data layer. Ideally ORM's should provide the means to abstract fully the database implementation, to generate that implementation within a range of RDBMS and to generate data fixtures for it, for building populated new development instances or for testing.
    As standard the test harness will setup and teardown the data layers.
    More mature ORMs will also have schema migration tools. These are essential for full automated release management, since invariably a significant release will involve a change to the data schema, or at least a new entry in the database. A synchronisation tool will tend to use meta-programming to automatically generate the migration code that synchronises the schema - that migration is then released (or rolled back) as part of the code release - keeping the data storage in the release management loop. Any data modification (DML) - that the application requires can be added to the DDL of the schema migration. These tools will also have introspection code to detect that data migration is required if connected to a previous version of the database. Bespoke applications may not have the tool, but at worst they should have data creation and migration code written and packaged with newly released versions - manual database tinkering around the same time as the code release, is not acceptable.
  9. Package management
    Application layer package management will always be language specific, but any language should offer it. Ideally a package repository should be maintained for each language your services use. These may be core to the language like PyPi and RubyGems or for languages without them in the core there are commercial offerings like Nexus for Java.
    This caters for version dependency management and reliable upgrade. Of course to use a package manager fully, you should package all your application source code. Ad hoc scripts or  framework app archives, raw class and resource bundles etc. - Just say no. If you are going to release your code rather than chuck it over the wall ... package it and version it. So all your code should be in jars, eggs, gems - or whatever your language likes to call them.
    Not only that you should apply the same rules to splitting up packages as you apply to splitting up code into classes. Some packages may be dependent on others - but each separate component of the application should be a different package - to allow it to be separately version controlled and released. To encourage encapsulation and hence allow for packages to be reused, retired or replaced without replacing the whole application's code base.
    (NB: Environment package management will be operating system specific and that should be implemented as part of the standard environment config management layer - no building from source here!)
  10. Monitoring
    One of the most important issues with logging and error notification is the cry wolf factor. You need to ensure that you draw the line in the right place for what are critical errors - ie. those that generate notifications to people. You can have over reporting initially if it makes you hammer down on all those bugs to get a reasonable level. But the one thing that makes monitoring ineffective, is over reporting, if a system is emailing you a hundred stack traces a day, and have been for the last month - or the critical log is equally verbose - you filter the emails and ignore the log. You need critical bug notifications to be rare enough that you jump straight on fixing them when they are sent. Of course don't over do it, ideally you shouldn't ever be in the position when the only reason you know that a service is down is because an end user has phoned up to tell you. If your monitoring is good enough it will always do that, for all but the most involved functional errors.
    You also need standard uptime monitoring such as Nagios or the like to notify if services have failed completely (unable to send application layer errors) for each of the layers - web, storage, cache, environment.
    Plus load logging for each, response time logging, etc. Most importantly you need to retain the logging over time and hence be able to look back at problems vs. change management data (see automated release management) to be able to diagnose many service issues and ideally predict and forestall them.

Walk the walk

So do I have the ten commandments in place for all our production systems in my current work place? In part, we have for all our Python Django web applications (although some are a bit sparse in places - eg. monitoring, release management below the application layer). But our Java architecture only has packaged components, although work is being done for new Java Spring systems to provide automated build, ideally some tests and the need for monitoring is recognized. Hopefully we will  tick all ten boxes for it too, eventually. So we will have as solidly maintainable a Java Spring platform as we have with our Python Django infrastructure.

However the concern is perhaps as much with all our legacy or outsourced systems integration code. These have none of these components and no realistic likelihood of getting them. Hence there is a  huge support burden that results, diverting time away from providing them and leading to unreliable services. Add to that the problem of how platforms can be frozen, whilst still in use, as with our legacy Python (zope) architecture and then rot and lose the maintenance infrastructure that they had, (Our old CMS went live with half of the above features - now it has none) and the picture becomes a little bleak. Here the answer is perhaps to start to implement much more hard nosed rules wrt. to retiring systems, if they have replacements, whether or not those replacements fully cover the same functional space. Essentially this is a management, not a technical issue.

With a much reduced set of critical legacy systems and appropriate resourcing it would be possible to retrograde add the commandments to them, and bring all services up to a similar quality control.

However  the problem is greatly exacerbated by 'new' legacy bought in systems. So by this I mean third party supplier systems that we run and have to maintain (eg. regular upgrades, performance monitor etc.) that do not have most of the above features. Unfortunately something that appears true of all the smaller supplier's systems procured recently - ie. companies with under 10 core developers. Perhaps because most of them are providing products that actually are legacy ie. have not been written, or fully rewritten, in the last 6 years (for the full rant on this topic see the ten commandments of software procurement!)

* Fixing the legacy and external systems

There are plenty of configuration management and shell framework tools that can be applied to automate even the messiest old legacy systems. The key rule here is you don't need to write any of the infrastructure in the legacy code base. So use your standard CI server, shell framework and config management tools - don't add more procedural platform specific code (e.g. raw shell scripts).
Modern automation tools should all be pretty platform independent - although if running Windows and Unix you may be better using a different shell framework for each, eg. Fabric and PowerShell, possibly the same for config management tools.

If the code contains closed source compiled components with no versioning. Then the binaries can still be put into version control and release numbers assigned. At worst decompilation tools can be used - if there no other reasonable way to fix or replace the components.

Similarly black box testing tools can be applied to any software, and if none of the technical team know what that code is doing - end users can provide a basic functional spec of what its meant to do, and these few basic stories used to create some minimal BDD tests.
Data in / data out dumps and comparisons can also be used as a basis for manually maintained fixtures. Legacy components can be split up and packaging added to them ... but then much more work along this line of legacy code re-factoring and we start to raise the question of respecify / rewrite / replace being more cost effective.

Wednesday, October 29, 2014

Fixing third party Django packages for Python 3

With the release of Django 1.7 it could be argued that the balance has finally tipped towards Python 3 being its preferred platform. Well given Python 2.7 is the last 2.* then its probably time we all thought about moving to Python 3 for our Django deployments.

Problem is those pesky third party package developers, because unless you are determined wheel reinventor (unlikely if you use Django!) - you are bound to have a range of third party eggs in your Django sites. As one of those pesky third party developers myself, it is about time I added Python 3 compatibility to my Django open source packages.

There are a number of resources related to porting Python from 2 to 3, including specifically for Django, but hopefully this post may still prove useful as a summarised approach for doing it for your Django projects or third party packages. Hopefully it isn't too much work and if you have been writing Python as long as me, it may also get you out of any legacy syntax  habits you have.

So lets get started, first thing is to set up Django 1.7 with Python 3
For repeatable builds we want pip and virtualenv - if they are not there.
For a linux platform such as Ubuntu you will have python3 installed as standard (although not yet the default python) so if you just add pip3 that lets you add the rest ...

Install Python 3 and Django for testing


sudo apt-get install python3-pip
(OR sudo easy_install3 pip)
sudo pip3 install virtualenv



So now you can run virtualenv with python3 in addition to the default python (2.*)

virtualenv --python=python3 myenv3
cd myenv3
bin/pip install django


Then add a src directory for putting the egg in you want to make compatible with Python 3 so an example from git (of course you can do this as one pip line if the source is in git)


mkdir src
git clone https://github.com/django-pesky src/django-pesky
bin/pip install -e src/django-pesky


Then run the django-pesky tests (assuming nobody uses an egg without any tests!)
so the command to run pesky's test may be something like the following ...

bin/django-admin.py test pesky.tests --settings=pesky.settings
One rather disconcerting thing that you will notice with tests is that the default assertEqual message is truncated in Python 3 where it wasn't in Python 2 with a count of the missing characters in square brackets, eg.

AssertionError: Lists differ: ['Failed to open file /home/jango/myenv/sr[85 chars]tem'] != []


Common Python 2 to Python 3 errors


And wait for those errors. The most common ones are:

  1. print statement without brackets
  2. except Error as err (NOT except Error, err)
  3. File open and file methods differ.
    Text files require better quality encoding - so more files default to bytes because strings in Python 3 are all stored in unicode
    (On the down side this may need more work for initial encoding clean up *,
    but on the plus side functional errors due to bad encoding are less likely to occur)
  4. There is no unicode() method in Python 3 since all strings are now unicode - ie. its become str() and hence strings no longer need the u'string' marker 
  5. Since unicode is not available as a method, it is not used for Django models default representation. Hence just using
    def __str__(self):
            return self.name
    is the future proofed method. I actually found that models with __unicode__ and __str__ methods may not return any representation, rather than the __str__ one being used, as one might assume, in Django 1.7 and Python 3
  6. dictionary has_key has gone, must use in (if key in dict)

* I found more raw strings were treated as bytes by Python 3 and these then required raw_string.decode(charset) to avoid them going into the database string (eg. varchar) fields as pseudo-bytes, ie. strings that held 'élément' as '\xc3\xa9l\xc3\xa9ment' rather than bytes, ie. b'\xc3\xa9l\xc3\xa9ment'

Ideally you will want to maintain one version but keep it compatible with Python 2 and 3,
since this is both less work and gets you into the habit of writing transitional Python :-)

Test the same code against Python 2 and 3


So to do that you want to be running your tests with builds in both Pythons.
So repeat the above but with virtualenv --python=python2 myenv2
and just symlink the src/django-pesky to the Python 2 src folder.

Now you can run the tests for both versions against the same egg code -
and make sure when you fix for 3 you don't break for 2.

For current Django 1.7 you would just need to support the latest Python 2.7
and so the above changes are all compatible except for use of unicode() and how you call open().

Version specific code


However in some cases you may need to write code that is specific to 2 or 3.
If that occurs you can either use the approach of latest or anything else (cross fingers)

try:
    latest version compatible code (e.g. Python 3 - Django 1.7)
except:
    older version compatible code (e.g. Python 2 - Django < 1.7)

Or you can use specific version targetting ...

import sys, django
django_version = django.get_version().split('.')

if sys.version_info['major'] == 3 and django_version[1] == 7:
    latest version
elif sys.version_info['major'] == 2 and django_version[1] == 6:
    older django version
else:
    older version


where ...

django.get_version() -> '1.6' or '1.7.1'
sys.version_info() -> {'major':3, 'minor':4, 'micro':0, 'releaselevel':'final', 'serial':0}

Summary

So how did I get on with my first egg, django-csvimport ? ... it actually proved quite time consuming since the csv.reader library was far more sensitive to bad character encoding in Python 3 and so a more thorough manual alternative had to be implemented for those important edge cases - which the tests are aimed to cover. After all if a CSV file is really well encoded and you already have a model for it - it hardly needs a pesky third party egg for CSV imports - just a few django shell lines using the csv library will do the job.







Thursday, July 3, 2014

Spring MVC setup on Ubuntu

Recently setting up Spring MVC on Ubuntu 14 with Netbeans wasn't entirely obvious for a newbie, so I thought I would document it in case it saved somebody 10 minutes!


First install Apache and Tomcat, if you haven't got them already...

sudo apt-get install apache2


sudo apt-get install tomcat7 tomcat7-docs tomcat7-admin tomcat7-examples

You should also have the default openjdk for tomcat and ant build tool and git

sudo apt-get install default-jdk ant git

Edit the tomcat-users.xml netbeans requires a user with the manager-script role
(NOTE: you shouldn't give the same user all these roles in a production Tomcat!
Also note that these manager roles have changed from Tomcat 6)

sudo emacs /etc/tomcat7/tomcat-user.xml


<tomcat-users>
  <role rolename="manager-gui"/>
  <role rolename="manager-script"/>
  <role rolename="manager-jmx"/>
  <role rolename="manager-status"/>
  <role rolename="admin-gui"/>
  <role rolename="admin-script"/>
  <user username="admin" password="admin" roles="manager-gui,manager-script,manager-jmx,manager-status,admin-gui,admin-script"/>
</tomcat-users>


Should restart tomcat after editing this ...

sudo service tomcat7 restart

Now you should be able to go to http://localhost:8080 and see

It works !

If you're seeing this page via a web browser, it means you've setup Tomcat successfully. Congratulations! ...

Click on the link to the manager and get the management screen

If the login fails - reinstall apache and tomcat - it worked for me!

For Netbeans to find Apache OK you have to put the config directory where it expects it ...

sudo ln -s /etc/tomcat7/ /usr/share/tomcat7/conf

Note that the tomcat location, ie for the deploy directory is in

/usr/var/lib/tomcat7

Now install Netbeans, latest version is 8, either by download and install or

sudo apt-get install netbeans

Start up netbeans and go to  Tools > Plugins

Pick the Available plugins tab

Search for web and tick Spring MVC - plus any others you fancy!

Restart Netbeans

Add a new project

  1. Choose New Project (Ctrl-Shift-N; ⌘-Shift-N on Mac) from the IDE's File menu. Select the Java Web category, then under Projects select Web Application. Click Next.
  2. In Project Name, type in HelloSpring. Click Next.
  3. Click the Add... button next to the server drop down
  4. Select the Apache Tomcat or TomEE server in the Server  list, click Next
    Enter  Server Location: /usr/share/tomcat7
    Enter the username and password from your tomcat-users.xml above and untick the create user box, if everything is working then it will accept this and add Tomcat to your server drop down list 
    (it shouldn't need to try to add the user unless that user isn't already properly set up with the manager-script role in Tomcat)
  5. In Step 4, the Frameworks panel, select Spring Web MVC.
  6. Select Spring Framework 3.x in the Spring Library drop-down list. 
    Spring Web MVC displayed in the Frameworks panel

 Click Finish and you should have a skeleton Spring MVC project, pressing the Play button should build it and run it up, then launch your chosen browser with the home page of that project via the Apache Tomcat you have setup.
Any changes should get auto-deployed and popped up in the browser again by pressing play.


Friday, May 2, 2014

Lessons learned from setting up a website on Amazon EC2

I recently got involved with helping someone sort out their website on an Amazon EC2 instance, it had been a few years since I had the need to do anything with EC2, I realised that I was a novice in this world - and it raised a number of issues related to deploying to EC2 and performance.

So I thought it may be useful to run through them for any other EC2 novices who are asked to do something similar, and want to learn from my rather blundering progress through this :-)

Apologies for those of you are already well familiar with EC2 for covering some of the basics.

The system moodpin.co.uk was based on a commercial PHP application, Pintastic.
So this allows you to set up a site like pinterest.com or wanelo.com
These sort of sites are for creating subject specific photo sharing social media systems, so like Instagram, Picassa etc. but focussed around communities of shared (usually commerical) interest. For example buying shoes, interior decor etc.
The common UI that they tend to present are big scrolling pages of submitted images related to topics for sharing, comment and discussion.

So this system sends out a lot of notification emails, involves displaying hundreds of images per page - the visual pin board - and to help with performance has custom caching built in - triggered by cron jobs.

Hence we have a number of cron jobs with the caching ones running every couple of minutes. To me this appeared a pretty crude caching mechanism - but my job was not to rewrite the application, but just tweak the code and get it all running OK.
The code mainly uses a standard MVC approach like everything else these days!

So demonstrating how outdated my knowledge of EC2 or this application were. I thought OK - first of all what platform is it. It was Amazon's own Linux - this uses yum rather than apt for package installs so as distros go its perhaps more Redhat-like than Debian.

For those unfamiliar with the basics - go to Amazon web services and sign up!
You can then choose to add some of the 40 odd different services that are available under the EC2 umbrella.

Once you have signed up to a few of these, you get a management console that links to a control dashboard for each service. The first step usually being - the one with the computer instances on, EC2. From there you can pick an AMI (ie. operating system image), a zone - eg. US West (Oregon) and use it to create a new instance. Add an SSH key pair for shell access and then fire it up and download the pem file so you can ssh into your new Amazon box.

So the client wanted the usual little tweaks to PHP code,  CSS tweaking - so easy stuff its just web development ... done in a jiffy (well after digging through the MVC layers, templating language, cache issues and CSS inheritance etc. for a fairly complex PHP app you have never come across before, when PHP is not exactly your favourite language ... jiffyish maybe)
Then we got to the more SysAdmin related requests ... lets just say I probably shouldn't rush out and buy a DevOps tee-shirt just yet ...

'Get email working'

  1. Try to send an email from the web application - write a plain PHP script that just sends a test email - just run mail from the linux command line ... Got it there is no MTA installed! 
  2. Install an MTA - sendmail. Go back up that stack of actions and they are all working ... hurray that was easy.
  3. A week or so later ... 'emails stopped working'
  4. Go back to step 1. and yep - emails stopped working
  5. Look at the mail logs and see what the problem is.
  6. Realise that there are masses of emails being sent out ... but all of it is bouncing back as unverified.
  7. Think ... wow that pintastic site's notifier is busy - must be getting lots of traffic *
  8. So why has Amazon started bouncing all the email?
  9. Search Amazon's docs. Amazon has a very minimal test quota allowed for email. Once that quota is filled, unverified email will be blocked.
  10. Amazon has historically been one of the main sources of SPAM machines, that history means that it has to set up a much more elaborate mechanism for validating email that most hosting companies, and it no longer allows direct emailing from EC2 boxes (apart from minimal test quotas)
  11. So what we need to do is set up our mail to be sent via the Amazon SES service - add SES service and enable it
  12. So now we need to send authorised emails to the Amazon SES gateway that will then forward them on to the outside world
  13. Try to get sendmail to send authenticated emails, follow guide but it continues to bounce with authentication failure, give up and install postfix, follow the 20 steps of setting up the SASL password etc., and eventually it doesn't bounce with authentication errors - hurray!
  14. But the email still bounces. So we need to verify all our sending email addresses - managed by the SES console - or use DKIM to get the whole domain verified and signed from which we are sending.
  15. Modify the emails used by the sending software to ones which we can receive and validate - send and validate them. Our emails are working again.
  16. Leave it a few days, we are not sending email anymore, boooo!
  17. Check all the SES documentation, surprise, surprise SES also has quota limits for test level only, and you have to formally apply to get those limits lifted.
  18. Contact the client and get him to make a formal request for quota lifting on his account.
  19. *As part of the investigation check that email log a little more closely, it seems rather large, and we seem to be using up our quotas really quickly ... ah the default setup for unix cron sends an email for every job that returns text. The pintastic cache job returns text, so we are sending a pointless email every two minutes ... or trying to ... whoops. Make sure no cron or other unix system command is acting as a SPAM bot.
  20. A few days later - Amazon say our quota has been lifted
  21. Our emails have started sending again ... and they are still sending today !!!
Clients response, OK thanks, by the way since we added all the start up data / ie. uploaded images, the site takes at least two minutes to render the home page - or times out altogether.
Hmmm I did kinda notice that ... but hey he hadn't asked me to make the site actually usable speed wise ... until now!

'Why is the site, really, really slow?'


Hmmm wow it really is slow, lots of the time it just dies, that PHP cache thingy can't be doing much, so whats the problem.

  1. Lets look at the web site, wow it takes 5 minutes for the page to come back ... so this isnt exactly Apache bench territory ... run up a few tabs looking at the home page ... and it starts just returning server timeouts.
  2. So whats happening on the server ... whats killing the box ... top tells us that its Apache killing us here - with 50 odd processes spawning and sucking up all memory and CPU.
  3. So we check out our Apache config and its the usual PHP orientated config of MPM prefork. But what are the values set ... they are for a great big multiprocessor cadillac of a machine, whilst ours is more of a smart car in its scale. 
  4. Lesson is that Amazon AMI's are certainly not smart enough to have different image configs for different hardware specs of instances they provide. So it appears they default their configs to suiting the top of the range instances (since I guess they cost the most). If you have a minimal hardware spec box ... you should reconfigure hardware related parameters for the software you run on it ... or potentially it will fail.
  5. Slash all those servers, clients etc. values to the number of servers and processes the server can actually deliver. Slightly trial and error here ... but eventually we got MaxClients 30 instead of 500 etc. and give it a huge timeout.

    <IfModule prefork.c>
    StartServers       4
    MinSpareServers    2
    MaxSpareServers  10
    ServerLimit      30
    MaxClients       30
    MaxRequestsPerChild  4000
    </IfModule>
  6. Now lets hammer our site again ... hurray it doesn't completely fall over ... one day it may return a page, but its horribly horribly slow still ie. 3 minutes absolute top speed - further home page requests the slower they get.
  7. So lets get some stat.s, access the page with browser web dev network tools. Whats taking the time here. Hmmm web page a second, not great but acceptable, JS and CSS 0.25 sec, OK. Images hmmm images hmmm for the home page particularly ... 3-6 minutes ... so basically unusable.
  8. So time to bite the bullet we know Apache can be slower at serving static pages if its not optimised for it - especially if resources are limited (its processes have a bigger memory overhead), thats why the Apache foundation has another web server, Apache Trafficserver , for that job
  9. But whats the standard static server (thats grabbed half of Apache's share of the web in the last few years), yep nginx
  10. So lets set up the front end of our site as nginx acting as a reverse proxy to Apache just doing the PHP work, with nginx serving all images. So modify Apache to just serve on 8080 on localhost and flip the site over to an nginx front end, with the following nginx conf ...

    server {        listen       80;
            server_name  moodpin.co.uk;
                                                                                                                                                                                                       
            location ^~ /(cache|cms|uploads) {
                     root   /var/www/html/;
                     expires 7d;
                    access_log  /var/log/nginx/d-a.direct.log ;
            }
                                                                                                                                                                                                      
            location ~* \.(css|rdf|xml|ico|txt|gif|jpg|png|jpeg)$ {
                     expires 365d;
                     root  /var/www/html/;
                    access_log  /var/log/nginx/d-a.direct.log ;
            }

          location / {
                proxy_pass         http://127.0.0.1:8080/;
     
    Wow, wow, so take that 3-6 minutes and replace it with 1-2 seconds.
  11. So how many images on the home page - about 150 plus more with scrolling ... so that means we have a site that is on average under 0.5% dynamic code driven content and 99.5% static content/requests per page.
    That is a very very static site - hence the 100 x faster speed!
  12. So there you go client take that souped up smart car and go 
  13. Client replies ... ummm sites down - server proxy timeout error
  14. Go to Google and check, so we have to make sure that nginx has timeout settings greater than Apache's - and nginx default timeout is 60 seconds
  15. Make nginx _timeout settings into 10 minutes ... sounds bad, try the site, and it consistently delivers pages in 3 seconds or so assume that the scrolling request update page nature of the app, makes the timeout required much longer than the apparent time Apache is delivering PHP within?
  16. Show the client again, hes happy.
  17. Few days later ... this bit of the sites not working now
  18. Check the code, discover that there is a handful of javascript files used by the system that are not really static - they are PHP templates generating javascript that appear static. Remove js file types from the list of files above in the nginx config. Hurray generated javascript served from Apache PHP now. Bit of site works again
  19. OK we are done ... don't run Apache bench against the site ... if the client actually gets any users and it cant' cope - tell him to upgrade his instance.

    I hope my tails of devops debuggery are useful to you, Bye!
       

    Monday, January 13, 2014

    Postgres character set conversion woes

    I had to struggle with sorting out some badly encoded data in Postgresql over the last day or so.
    This proved considerably more hassle than I expected, partly due to my ignorance of the correct syntax to use to convert textual data.

    So on that basis I thought I would share my pain!

    There are a number of issues with character sets in relational databases.

    For a Postgres database the common answers often relate to fixing the encoding of the whole database. So if this is the problem the fixes are often just a matter of setting your client encoding to match that of the database. Or to dump the database then create a new one with the correct encoding set, and reload the dump.

    However there are cases where the encoding is only problematic for certain fields in the database, or where you are creating views via database links between two live databases of different encodings - and so need to fix the encoding on the fly via these views.

    Ideally you have two databases that are both correctly encoded, but just use different encodings.
    If this is the case you can just use convert(data, 'encoding1', 'encoding2') for the relevant fields in the view.

    Then you come to the sort of case I was dealing with. Where the encoding is too mashed for this to work. So where strings have been pushed in as raw byte formats that either don't relate to any proper encoding, or use different encodings in the same field.

    In these cases any attempt to run a convert encoding function will fail, because there is no consistent 'encoding1'

    The symptoms of such data is that it fails to display. So is sometimes its difficult to notice until
    the system / programming language that is accessing the data throws encoding errors.
    In my case the pgAdmin client failed to display the whole field so although the field appears blank, matches with like '%ok characs%' or length(field) still work OK. Whilst the normal psql command displayed all the characters except for the problem ones, which were just missing from the string.

    This problem has two solutions:

    1. Repeat the dump and rebuild approach with the correct encoding, but to write a custom script in Perl, Python or the like to fix the mashed encoding - assuming that the mashing is not so entirely random as to be fixable via an automated script*. If it isn't - then you either have to detect and chuck away bad data - or manually fix things!

    2. Fix the problem fields via pl/sql, pl/python or pl/perl functions that process these to replace known problem characters in the data.

    I chose to use pl/sql since I had a limited set of these problem characters, so didn't need the full functionality of Python or Perl. However in order for pl/sql to be able to handle the characters for fixing, I did need to turn the problem fields into raw byte format.

    I found that the conversion back and forth to bytea was not well documented, although the built in functions to do so were relatively straight forward...

    Text to Byte conversion => text_field::bytea

    Byte to Text conversion => encode(text_field::bytea, 'escape')

    So employing these for fixing the freaky characters that were used in place of escaping quotes in my source data ...

    CREATE OR REPLACE FUNCTION encode_utf8(text)
      RETURNS text AS
    $BODY$
    DECLARE
        encoding TEXT;
    BEGIN
        -- single quote as superscript a underline and Yen characters              
                                                
        IF position('\xaa'::bytea in $1::TEXT::BYTEA) > 0 THEN
            RETURN encode(overlay($1::TEXT::BYTEA placing E'\x27'::bytea from position('\xaa'::bytea in $1::TEXT::BYTEA) for 1), 'escape');
        END IF;

        -- double quote as capital angstroms character                                                                                                                              
        IF position('\xa5'::bytea in $1::TEXT::BYTEA) > 0 THEN
            RETURN encode(overlay($1::TEXT::BYTEA placing E'\x22'::bytea from position('\xa5'::bytea in $1::TEXT::BYTEA) for 1), 'escape');
        END IF;
        RETURN $1;
    END;
    $BODY$

    Unfortunately the Postgres byte string functions don't include an equivalent to a string replace and the above function assumes just one  problem character per field (my use case), but it could be adapted to loop through each character and fix it via use of overlay.
    So the function above allows for dynamic data fixing of improperly encoded text in views from a legacy database that is still in use - via a database link to a current UTF8 database.

    * For example in Python you could employ chardet to autodetect possible encoding and apply conversions per field (or even per character)

    Site code, Google Apps integration and design - Ed Crewe 2011