Wednesday 1 December 2010

The maths of coding for performance

I became aware recently due to preparing for an interview, that
although I thought I knew a bit about algorithms and data structures in python, I was actually pretty ignorant of this field.
A gap in my programming skills resulting from being self taught rather than classically computer science trained. In the practical world of programming I knew how to write fast python, what data structures to use for which task and how to use them. But I didn't really understand why.

Welcome to the world of big-Oh and the internals of data structures, and the maths of how they should be carefully chosen and used to avoid bottlenecks in your code.

Having only a few days to prepare I rapidly digested a borrowed book
Data Structures & Algorithms in JAVA. Although java is not like python in terms of its clean resemblance to pseudo code, I do at least find it easy enough to read - compared to C++, and currently most books on this topic will be in one or the other. Although there are a few python ones too, the latest being Python algorithms.

As it turned out the book was an ideal introduction to the topic, and I would recommend it to anyone interested. I hesitate here to try and summarise it, but essentially there are some standard maths approaches which allow generic analysis of the speed and memory usage of algorithms, and a fairly small set of standard algorithms and data structures that come up over and over in computing.
It pays to know what these are when you are using them and when and how you should use one over another. For python the standard library has an even more restricted set compared to Java so there is no excuse for not knowing all of them - and their foibles.

Python data structures


Quick summary with some resulting basic python performance issues:

  • tuple / string - immutable list hence changing a string should always use format % to build a new string in one go, since adding creates and discards many strings

  • list - dynamic array (vector) so it has fast access and insert speed is slower the further to the left it is, so ideally always append and extend. Use for FILO.

    • Lists use Timsort a hybrid merge and insertion sort that is optimal for real world data patterns (Used as the default Android sort too)

    • NB: List comprehensions provide the optimised map and filter list manipulation functions that are the bread and butter of python coding



  • array - from array import array if you want a memory compact single type list then use an array (performance the same as a list)

  • set - more of a functional reason for usage so its quicker to convert list like types to sets and back for set type operations

  • deque - for speedy insertion, deletion, left popping (FIFO) at the expense of slower read times, use the double linked list deque

  • dictionary - this is an open addressed (single key value per bucket) hash table O(1) for most methods aside from iterating keys and values, O(n). Hence use has_key or if key in dict, never use if key in dict.keys().

    • Remember for python hashing instances of methods, classes etc. into a dictionary is a standard coding pattern.

    • Another hacky performance tip is d = d.update(d) since this doubles dict storage without adding the duplicates, making it more sparse and hence better performance for read/writes at the expense of more memory.



  • Ordered dictionary (2.7 or later) - adds ordered keys with the same amortized performance as an ordinary dict

  • Heapq - A self sorting binary tree suited to queue usage, import heapq then use heapify(list)

  • OOBTree - Not standard python but the corner stone of zope and the ZODB, BTrees are multi-way search trees with block sized numbers of leaves suited to memory / file system storage. This structure was also described in the final pages of the afor mentioned Java book.



Oh and the interview ... well it was for a certain company that uses Python for all its sysadmin and automation, but to a lesser extent for internal software development where C++ is king and java second. Python, Java and Javascript share the hosted platforms / SDKs role - have you guessed who it is yet? So all my algorithm prep was in vain because although they had contacted me due to my open source python packages, they were actually looking for a sysadmin and required relocation. Unfortunately even the chance of working for Google, can't turn me into a sysadmin ;-)
I flunked some basic unix questions without getting a single programming question let alone needing to give a big-Oh breakdown of the algorithms I might choose to solve some top coder like programming test. Ho hum ... at least I can definitely say I learned something from the experience.