Sunday 16 November 2014

The 10 commandments of maintainable web services

Here is a list of the ten core elements needed for a development to deployment phase infrastructure to provide a stable service for your web applications. Along with minimizing time wasted on bugs and issues that are unrelated to functional development, and slashing the maintenance time and cost - compared to systems without them. I guess it could also be called automation, automation, automation ...

It should be noted that just because an application is a legacy one. It does not mean that all of this infrastructure cannot be retro fitted to it. *
  1. Standard environment
    A set of consistently built and upgraded deployment phase environments - dev, demo, train, prod for the full application stack - e.g. app server, cache, web server and storage. All development and deployment is done on these entirely standard (ideally config management / virtualised) cloned environments. If random desktop / laptop computers must be used then ideally a virtual box build version should be provided for dev, to match the deployment ones.
    For web applications the server side will be a single environment, but if client side software is involved it may require multiple standard environments for build and test.
  2. Automated build
    Run one command or press one button to create a full application stack instance on any of the deployment environments. Including production. So this should be everything above the standard environment and ideally include storage too (see data automation). Each developer can build numbers of deployment instances in the same automated fashion. Builds should be remotely runnable for plugging into Continuous Integration, C.I., servers etc.
  3. Automated release management
    Particularly important is that no manual tasks are needed to deploy to production. A push button C.I. driven deploy should be used where each deployment is retained in a full log accompanied by summary deployment note and related software packages release history and source tag. This full logging of changes ties into software service change management concepts. If unforeseen dependency system issues develop a lot later, they can then potentially be tied to the highly detailed timestamped change logging that this provides.
    Automating the roll-out means that you should also automate the roll back. You hopefully will test well enough not to need a safety net, but not bothering to use one is reckless.
    Another common loop hole is that release only covers the application layer. The standard environment, storage etc. are all part of the stack and changes in them are also releases, and need the same release management controls in place.
  4. Revision control of the entire application stack
    Everything in the application should be versioned. So all the source code of course. But also all the deployment and automation code. The third party components should all have their own versions (if not download, version and deploy from your own local repository). That includes the application specific environment configuration, eg. Apache virtual host configuration.
    Build automation should allow specification of tags (or to a date or to any previous release - logged via C.I.)
    The code dependency stack should also be versioned - so the versions of every component for a system release. Language specific build tools such as Maven, Pip, Ant, Phing, Bundler, Buildout etc. provide this. The standard environment(s) should also be versioned via their config management tool. 
  5. Integrated documentation
    Core documentation should be written and versioned with the source code, each package should at least have README and a release HISTORY tied to each production releases version number. These need to be kept up to date with the rest of the source. Separate wikis for fuller / less technical docs are fine - but documentation of changes in functional specification need to use the same version control as the code - unless all your code has rigorous processes around a versioning integrated issue tracker - that is most reliably done by putting documentation in the code.
    Ideally the language's packaging tools should have a system to extract embedded documentation and comments into HTML on a software repository server - for easy reference.
    Automation to keep the web documentation up to date should be implemented. 
  6. Software upgrade process
    Major version platform upgrades should always be performed within a year of release date. Not just for security patch reasons (these must be carried out within a month at most). Ideally the former within a few months and the latter within a few days. Any longer and code divergence can make the upgrade hill too big a cost to scale, or compromise systems data. Major language / framework (as well as releases) - should not require significant system outages. These may not be automated to set up, but they should be automated to roll over between upgrades - so if you are without a multi-server load balanced layer in part of your application stack - then downtime should still be under a minute at most, e.g. an Apache or Database restart.
  7. Automated testing
    They may not provide great coverage but a minimal test suite is a necessity to allow confirmation of success for the automation infrastructure.
    Good test coverage means that complex functional errors or regressions can be written as tests and added to periodic builds - so ensuring that future releases are free of them - but a set of minimal functional or black box tests are sufficient to cover basic confirmation that automated environment upgrades, or minor application fix releases do not caused critical failures. These tests can also be tied to monitoring / timed load testing - to check upcoming releases for performance regressions.
  8. Data automationThis involves data fixtures, automated schema generation and synchronisation.
    With the advent of an object relational mapper (ORM) as standard in today's web applications. Then your system should have a full data abstraction layer, in even the most micro web application framework. In turn that means today's application code should contain within it the means to generate all of the data layer. Ideally ORM's should provide the means to abstract fully the database implementation, to generate that implementation within a range of RDBMS and to generate data fixtures for it, for building populated new development instances or for testing.
    As standard the test harness will setup and teardown the data layers.
    More mature ORMs will also have schema migration tools. These are essential for full automated release management, since invariably a significant release will involve a change to the data schema, or at least a new entry in the database. A synchronisation tool will tend to use meta-programming to automatically generate the migration code that synchronises the schema - that migration is then released (or rolled back) as part of the code release - keeping the data storage in the release management loop. Any data modification (DML) - that the application requires can be added to the DDL of the schema migration. These tools will also have introspection code to detect that data migration is required if connected to a previous version of the database. Bespoke applications may not have the tool, but at worst they should have data creation and migration code written and packaged with newly released versions - manual database tinkering around the same time as the code release, is not acceptable.
  9. Package management
    Application layer package management will always be language specific, but any language should offer it. Ideally a package repository should be maintained for each language your services use. These may be core to the language like PyPi and RubyGems or for languages without them in the core there are commercial offerings like Nexus for Java.
    This caters for version dependency management and reliable upgrade. Of course to use a package manager fully, you should package all your application source code. Ad hoc scripts or  framework app archives, raw class and resource bundles etc. - Just say no. If you are going to release your code rather than chuck it over the wall ... package it and version it. So all your code should be in jars, eggs, gems - or whatever your language likes to call them.
    Not only that you should apply the same rules to splitting up packages as you apply to splitting up code into classes. Some packages may be dependent on others - but each separate component of the application should be a different package - to allow it to be separately version controlled and released. To encourage encapsulation and hence allow for packages to be reused, retired or replaced without replacing the whole application's code base.
    (NB: Environment package management will be operating system specific and that should be implemented as part of the standard environment config management layer - no building from source here!)
  10. Monitoring
    One of the most important issues with logging and error notification is the cry wolf factor. You need to ensure that you draw the line in the right place for what are critical errors - ie. those that generate notifications to people. You can have over reporting initially if it makes you hammer down on all those bugs to get a reasonable level. But the one thing that makes monitoring ineffective, is over reporting, if a system is emailing you a hundred stack traces a day, and have been for the last month - or the critical log is equally verbose - you filter the emails and ignore the log. You need critical bug notifications to be rare enough that you jump straight on fixing them when they are sent. Of course don't over do it, ideally you shouldn't ever be in the position when the only reason you know that a service is down is because an end user has phoned up to tell you. If your monitoring is good enough it will always do that, for all but the most involved functional errors.
    You also need standard uptime monitoring such as Nagios or the like to notify if services have failed completely (unable to send application layer errors) for each of the layers - web, storage, cache, environment.
    Plus load logging for each, response time logging, etc. Most importantly you need to retain the logging over time and hence be able to look back at problems vs. change management data (see automated release management) to be able to diagnose many service issues and ideally predict and forestall them.

Walk the walk

So do I have the ten commandments in place for all our production systems in my current work place? In part, we have for all our Python Django web applications (although some are a bit sparse in places - eg. monitoring, release management below the application layer). But our Java architecture only has packaged components, although work is being done for new Java Spring systems to provide automated build, ideally some tests and the need for monitoring is recognized. Hopefully we will  tick all ten boxes for it too, eventually. So we will have as solidly maintainable a Java Spring platform as we have with our Python Django infrastructure.

However the concern is perhaps as much with all our legacy or outsourced systems integration code. These have none of these components and no realistic likelihood of getting them. Hence there is a  huge support burden that results, diverting time away from providing them and leading to unreliable services. Add to that the problem of how platforms can be frozen, whilst still in use, as with our legacy Python (zope) architecture and then rot and lose the maintenance infrastructure that they had, (Our old CMS went live with half of the above features - now it has none) and the picture becomes a little bleak. Here the answer is perhaps to start to implement much more hard nosed rules wrt. to retiring systems, if they have replacements, whether or not those replacements fully cover the same functional space. Essentially this is a management, not a technical issue.

With a much reduced set of critical legacy systems and appropriate resourcing it would be possible to retrograde add the commandments to them, and bring all services up to a similar quality control.

However  the problem is greatly exacerbated by 'new' legacy bought in systems. So by this I mean third party supplier systems that we run and have to maintain (eg. regular upgrades, performance monitor etc.) that do not have most of the above features. Unfortunately something that appears true of all the smaller supplier's systems procured recently - ie. companies with under 10 core developers. Perhaps because most of them are providing products that actually are legacy ie. have not been written, or fully rewritten, in the last 6 years (for the full rant on this topic see the ten commandments of software procurement!)

* Fixing the legacy and external systems

There are plenty of configuration management and shell framework tools that can be applied to automate even the messiest old legacy systems. The key rule here is you don't need to write any of the infrastructure in the legacy code base. So use your standard CI server, shell framework and config management tools - don't add more procedural platform specific code (e.g. raw shell scripts).
Modern automation tools should all be pretty platform independent - although if running Windows and Unix you may be better using a different shell framework for each, eg. Fabric and PowerShell, possibly the same for config management tools.

If the code contains closed source compiled components with no versioning. Then the binaries can still be put into version control and release numbers assigned. At worst decompilation tools can be used - if there no other reasonable way to fix or replace the components.

Similarly black box testing tools can be applied to any software, and if none of the technical team know what that code is doing - end users can provide a basic functional spec of what its meant to do, and these few basic stories used to create some minimal BDD tests.
Data in / data out dumps and comparisons can also be used as a basis for manually maintained fixtures. Legacy components can be split up and packaging added to them ... but then much more work along this line of legacy code re-factoring and we start to raise the question of respecify / rewrite / replace being more cost effective.