Wednesday, 1 December 2010

The maths of coding for performance

I became aware recently due to preparing for an interview, that
although I thought I knew a bit about algorithms and data structures in python, I was actually pretty ignorant of this field.
A gap in my programming skills resulting from being self taught rather than classically computer science trained. In the practical world of programming I knew how to write fast python, what data structures to use for which task and how to use them. But I didn't really understand why.

Welcome to the world of big-Oh and the internals of data structures, and the maths of how they should be carefully chosen and used to avoid bottlenecks in your code.

Having only a few days to prepare I rapidly digested a borrowed book
Data Structures & Algorithms in JAVA. Although java is not like python in terms of its clean resemblance to pseudo code, I do at least find it easy enough to read - compared to C++, and currently most books on this topic will be in one or the other. Although there are a few python ones too, the latest being Python algorithms.

As it turned out the book was an ideal introduction to the topic, and I would recommend it to anyone interested. I hesitate here to try and summarise it, but essentially there are some standard maths approaches which allow generic analysis of the speed and memory usage of algorithms, and a fairly small set of standard algorithms and data structures that come up over and over in computing.
It pays to know what these are when you are using them and when and how you should use one over another. For python the standard library has an even more restricted set compared to Java so there is no excuse for not knowing all of them - and their foibles.

Python data structures

Quick summary with some resulting basic python performance issues:

  • tuple / string - immutable list hence changing a string should always use format % to build a new string in one go, since adding creates and discards many strings

  • list - dynamic array (vector) so it has fast access and insert speed is slower the further to the left it is, so ideally always append and extend. Use for FILO.

    • Lists use Timsort a hybrid merge and insertion sort that is optimal for real world data patterns (Used as the default Android sort too)

    • NB: List comprehensions provide the optimised map and filter list manipulation functions that are the bread and butter of python coding

  • array - from array import array if you want a memory compact single type list then use an array (performance the same as a list)

  • set - more of a functional reason for usage so its quicker to convert list like types to sets and back for set type operations

  • deque - for speedy insertion, deletion, left popping (FIFO) at the expense of slower read times, use the double linked list deque

  • dictionary - this is an open addressed (single key value per bucket) hash table O(1) for most methods aside from iterating keys and values, O(n). Hence use has_key or if key in dict, never use if key in dict.keys().

    • Remember for python hashing instances of methods, classes etc. into a dictionary is a standard coding pattern.

    • Another hacky performance tip is d = d.update(d) since this doubles dict storage without adding the duplicates, making it more sparse and hence better performance for read/writes at the expense of more memory.

  • Ordered dictionary (2.7 or later) - adds ordered keys with the same amortized performance as an ordinary dict

  • Heapq - A self sorting binary tree suited to queue usage, import heapq then use heapify(list)

  • OOBTree - Not standard python but the corner stone of zope and the ZODB, BTrees are multi-way search trees with block sized numbers of leaves suited to memory / file system storage. This structure was also described in the final pages of the afor mentioned Java book.

Oh and the interview ... well it was for a certain company that uses Python for all its sysadmin and automation, but to a lesser extent for internal software development where C++ is king and java second. Python, Java and Javascript share the hosted platforms / SDKs role - have you guessed who it is yet? So all my algorithm prep was in vain because although they had contacted me due to my open source python packages, they were actually looking for a sysadmin and required relocation. Unfortunately even the chance of working for Google, can't turn me into a sysadmin ;-)
I flunked some basic unix questions without getting a single programming question let alone needing to give a big-Oh breakdown of the algorithms I might choose to solve some top coder like programming test. Ho hum ... at least I can definitely say I learned something from the experience.

Monday, 15 November 2010

Smoothing a mashup of moodle, mahara and drupal

I have been involved with the development of a system for an international school level qualification organisation over the last few months which has recently gone live in the USA. The work was undertaken in association with a UK learning resources company that was hosting the system.

This work was funded by a charitable foundation in the US, and provides community support and training to teachers at schools in deprived areas, aiming to bolster the low numbers of their students moving on to Higher Education.

The software was a mashup of PHP applications, Moodle the leading open source virtual learning environment, Mahara the e-portfolio app and Drupal the popular CMS.

This multiplied by three, the usual challenges of customising software written by third party developers to fit the clients requirements. It also involved a high level of integration work. Then to add to the challenge the software was to be developed on our servers but deployed to Triple A's. Finally PHP LAMP applications are not known for their clean separation of code, configuration and content.

All these factors meant that the development of a means for well controlled release management and content and configuration transfer between deployments was an essential factor for the success for the project.

Having used Fabric for simpler deployment tasks of python frameworks already, it was the obvious choice here too. Since although PHP also has shell frameworks, eg. PHP CLI, they are not as mature as Fabric, and PHP is not a natural object orientated shell scripting language.

Eventually we developed a set of fabric scripts that automated a number of the functions that allowed us to easily develop, deploy and debug the system. A single install server command would download the three applications on to the target machine, install them, extend drupal (with the help of drush) then add all the customization and integration code.

Similarly deploy to tag name could be used to release a version of the code set to the demo server for test and approval, prior to using the same command with just a server name change to deploy to the external hosts servers.

Further commands ran the involved single sign on / RPC user integration procedure for Moodle and Mahara (the instructions for doing this manually run to a 37 page PDF!)

Over all the complexity of all these operations and the different server environments lead to the process of maintaining the scripts was rather time consuming. However this time was easily recovered in allowing for automated deployment and set up. Perhaps more importantly the ability to quickly roll across database content, SSO integration etc against different code versions and servers, gave us an invaluable tool for both development and debug of critical issues that arose.

Ideally with more time / budget for tighter release management and pre-release testing and we could of had an even smoother and more robust ride.
But without a good shell framework we would have struggled to make a system at all, given such a mashup has hundreds of Mbs of third party code, and hundreds of tables across the three applications schemas.

Finally we now have a set of deployment tools that we can reuse to give us a big productivity boost in creating bespoke, VLE, e-portfolio, CMS systems in future.

Wednesday, 27 October 2010

Django barcamping

I hosted a django barcamp this week, in the midst of a hectic schedule of attending a UKOLN DevSci mobile day and the Plone Conference.

I had offered to host one of the events for the local Bristol and Bath Django users group, DBUG, after attending the last one. So when prodded I suggested lightning talks and set about putting together a couple. Having got the ball rolling I thought I should offer to host to. The beer and pizza came for free via Dan Fairs contact at Austin Fraser IT recruitment agency.

It was good fun and quite successful ~ 40 odd attendees, so aside from ending up tidying all the empty beer bottles ;-) I would recommend it.
I did a talk on using Google App Engine, which I will also have to post on in more detail - once I have finally written the accompanying Android app that goes with the GAE site.

The main positive was just how in demand python and particularly django skills are at the moment, certainly no difficulty in getting a job locally that pays equivalent to a University developer post. The positive buzz in commercial python web development also made a nice contrast to the current cold winds blowing through the public sector.

Finally it was a good social occasion to get to meet, and get a bit drunk, with a bunch of fellow programmers, who were interested in the same coding tools as myself.

Sunday, 25 July 2010

EuroPython 2010

Just got back from the EuroPython 2010 conference in Birmingham this week. See my day by day write up.

It was a change for me, my first language conference, as opposed to web platform or more academic one, enlivened by some very entertaining talks, from Martijn Fassen and the Guardian's Michael Brunton-Spall.

Other highlights included the clear educational style of Raymond Hettinger, who made open to me the internals of twisted, generators and many other things. Along with particularly relevant talks such as
Laurence Rowe's one on XDV which I could go back to work with the next day and immediately investigate as an ideal tool for using with a moodle, mahara, drupal, PHP mashup, via Apache mod_transform.

The spectrum of python users was wide including HPC (parallel processing), maths, games, robotics, cloud computing and the more familiar web uses and platforms.

Dominating was probably the more commercial Google App Engine / Django related stuff, followed by the more academic parallel processing / twisted contingent. One thing I felt was surprisingly absent was the software packaging, management area, which python dominates in the linux world. However maybe thats because sys-admins dont attend language conferences.

I gave a very bad lightning talk in a small effort to remedy this. Next time I will have to remember to decide what I am going to say ... rather than just make sure the slides work!

There was a distinct feeling that zope was on the wain, and that zope3 development was now just Grok, with enthusiasts battling on despite their dwindling numbers.

I guess that raised a concern that although Google's wholesale adoption of Python as its language of choice, and django as its web framework, are great for boosting its profile. There is the danger that it could become so Googlized that other python frameworks that compete in the same arena are drowned out, or that python becomes Google's language just like java is Oracle's.

Saturday, 26 June 2010

fabric and deployment tools

For some years I have made it a rule that if I ever find myself putting more than a couple of lines into a shell script, that I should make it a python script instead. That way, if  I end up reusing it and it inevitably grows in complexity. I dont drift into the land of unmaintainable conditionals and loops of bash scripts of hundreds of lines, but stay with familiar and modular python modules, classes and functions.

This use of raw python for shell scripts has evolved over the last month or so into using the python library Fabric. Now all my shell scripts, are metamorphosing into  its aptly named fabfiles. There are a number of what may be called shell frameworks around with Fabric being the leading python one, and Capistrano an equivalent Ruby offering.

The core benefits that these tools offer is a set of handy functions and standard approaches to running ssh based shell commands across one or more servers. Since everything is in effect controlled locally via ssh the servers themselves need nothing extra installed on them, and indeed if they are missing some standard utilities, then they can be also be run locally instead and the resulting preprocessed files punted over to the remote box.

The hidden benefit they offer is that by taking a few minutes to always commit any repeated command line to a fabfile, you soon  build up a modular library of common shell processes that would otherwise either exist as very project / platform specific shell scripts or maybe even more inefficiently still be done manually.

You are  now on the road to recording (and hopefully comitting to a versioning system) all the processes that surround a project, and hence take a major step towards its future maintenance, documentation and hopefully fully automated build, release, data population, testing etc.

How does it work?

The answer is that its just python, and its just standard command line actions. So for example perhaps the most basic of functions, passed no args, to restart apache on a remote production server ...

@hosts('', '')
def restart_prod:
  """ Restart the production apaches """
  out = sudo(apache2ctl configtest)
  if out.find('Syntax OK')>-1:
    sudo(apache2ctl graceful)
    print 'Not restarting apache due to config errors - see above'

This example restarts apache on two production apache servers that deliver the site, it checks to see if the config is OK first then uses fabrics sudo function to run the restart if it is.

The next steps are you normally want to restart Apache for a reason.
So you are rolling out a code change perhaps. Or rebuilding data, or clearing server session caches etc. The first of these is perhaps the most common. The temptation is only to rollout code via upping it from your versioning system. But with a simple fabric function you can add a local check in to this. If its a production server that its rolling out to, then a function that tags the current HEAD or TIP to prod_stable and to an auto incremented release number is also handy.
Of course if you have written a test suite then automatically running that on a demo build may a step you want to add in as an acceptance flag for allowing the prod_stable tagging.
You may also need to do some standard actions after checking out code such as relink in some local var data directory into the checkout locale.

All these actions are simple in themselves but take a number of commands and across different servers are quite time consuming.
Now you can just run one or two lines to rebuild your data, deploy your code, eg.
fab deploy:prod

Almost as vital for future maintenance / handover, you also have a versioned script containing all the configuration and deployment details headed with constants that describe all the dependent servers of your dev, demo, test, train or production deployments for the project and can live with the project code in a build directory.

Where shell frameworks, build tools and configuration management meet

Having said that there are a number of areas where a shell framework can be used to do things that are really either in the realm of a build tool, or a configuration management system. The line that should be drawn here is that if a platform has  mature build tools that work with it, e.g. buildout or pip in the case of python, then dont recreate them in a shell framework. Similarly for configuration management. This is a difficult line to draw between developer and sys-admin.  A practical line is anything that is specific to the project code, data fixtures etc., is suited to developer custom shell framework deployment / scripting . Whilst the VM, OS, core language installation, project layout, data backup is more in the realm of configuration management (puppet, bcfg, et al.). But this may just be because we tend to have release management as a developer role, whilst system failover rebuild etc. is one for the sys admin.

Of course this is a natural point of conflict with language build tools, shell frameworks etc. aiming to allow completely replicable build or configuration  processes across any platform. Whilst config management tends to revolve around building the whole platform - hence naturally aims to use whichever platforms its building's own setup tools (e.g. linux distros package managers) as the most stable and tested for it.

So as a rule of thumb if you have to build consistently across different hosters where you dont have access to config management then you may be tempted to step more on its toes with a shell framework, similarly if build tools are somehow restricted for you, then their territory is up for grabs. But on the whole all of these tools can have a place in a configuration and deployment stack, as illustrated in this posting on a deployment blog.

Wednesday, 7 April 2010

Time to get with distributed versioning

I decided to bite the bullet and get on board with a distributed versioning system over Easter.

There were a number of reasons for this. Firstly other developers at ILRT were using it and extolling its virtues, the primary one for me being that DVCS is by its nature more in keeping with the open source ethos. In that its architecture promotes a communal distribution of the source, there is no overall control via a central repository and HEAD, but instead there can be many distributed repositories with their own 'TIPS' and an original set of source code can more easily evolve for different use cases. So its ownership is more open.

Another reason is that it seems to have gained a lot of development effort in the last few years, with the two leaders Mercurial and Git gaining features more quickly than the leading versioning system in the open source world, SVN (which in turn supplanted CVS). The other advantages of DVCS are the more mundane ones of performance and convenience. Since there is no need for server access when checking in changes etc. then this can be significant. Instead of directories stuck all through your code you just get one or two at the top level holding the whole repo for that code locally.

So which DCVS have I gone for. I basically used the same reasoning as Google Code which has plumped for adding Mercurial alongside SVN, ie. it is easier to learn when coming from SVN since the basic commands are the same. It has wider platform compatibility / tools (e.g. TortoiseHg). Along with the fact that its all python which I know well, and that Google Code has chosen it along with BitBucket already being available. Making two of the main open source hosters Mercurial to GitHub's one. Having said that I found using it on Google Code problematic and  its interface less intuitive than BitBucket's so I have plumped for that as my open source hoster. Its also got a cooler name - and offers Basecamp integration which we use.

Currently I am just moving in my public open source projects that are not already available in a public SVN repo (ie the plone collective). I guess time will tell whether it makes sense to pay a few dollars a month to use this hoster for private repos too, or to set up Mercurial locally with integrated wiki and issue tracker.

Wednesday, 17 February 2010

Database performance

Over the last couple of days I have been attending an Oracle 10g/11g performance tuning course given by iTrain, which had been booked by another team of programmers in the University where I work.

I have always been reluctant to be labeled as a relational database programmer or even worse as a part-time dba, since it has never appeared to be the most thrilling element of web development. Along with the need to keep it in perspective with other storage options such as RDF or object databases like the ZODB.

As a result I have probably under prioritized acquiring knowledge of database theory and skills, so it was a real eye opener to go on this course and realise just how little I knew about coding for database high performance. So my query optimization was pretty much limited to trying rewrites or the odd view, maybe adding the default type of index, and very occasionally running explain plan - but not really knowing how to fully interpret it!

Although the course was one focused on Oracle wrt. the details, the general principles were entirely applicable to any other relational database such as Postgres or MySQL.
For example the fundamental rdbms architecture of the optimizer (rule or cost based) that generates the query execution plans that govern query performance. How indexing, analyze statistics and other factors determine this. Are all common to Postgres and MySQL, and the process of writing and tuning your schema and queries based on this is one that should really be applied to all but the smallest scale data projects - ie a few hundred rows or less per table.

Some parts had wider applicability than that, for example the majority of the indexes information could apply just as well to the indexes used within the ZCatalog that comes as part of zopes object database.

So I aim to finish writing up a quick tips summary of the course for database performance tweaks that is fairly non-db platform specific.

Thursday, 4 February 2010

django freaks ... or not?

I went to the local bristol and bath django users group meet up at the Watershed bar in Bristol last night.

Although I had feared they would be rather hardcore platform geeks ... actually they were surprisingly platform unspecific with pretty much everyone having done or doing stuff in something else, eg java .net or php.
No perl though - I guess that indicates the under 40s profile that I was attempting to subvert

There was a slight hint of plone / zope refugee status. Although nobody was using django as a CMS solution some were using drupal like we do in the ILRT that I work in at the University of Bristol.

Aside from the pure techiness it was good to chat with freelancers outside of academia, and as a group of practical developers there was none of that platform evangelism that be rather wearing at highly platform specific events.

Thursday, 28 January 2010

First django egg release

I have just released my first django egg – see django-oraclepool. This is a packaging of some code available on the django site for pooling oracle connections which gave truly radical performance benefits within our production setup.
Hence it seemed of value to package it and add performance tests, etc. to make it easy for others to use.

(Further details on my django experiences).