Monday 11 December 2023

Software development with Generative AI

The Current State of AI Software Generation

The user tries to describe what they want generated in terms of a snippet of high level programming language code using standard English. They submit it to the AI tool. So what are they asking the AI to generate and how does it do it?

The high level language

High level programming languages are human languages composed of english and maths symbols designed for the comprehension and composition of precise computer instructions. The language makes no more sense than English to a computer. It has to be compiled or interpreted to computer language for it to run. So it may compile to an intermediate bytecode language and then maybe to human readable assembly language - before final translation into the unreadable machine code that the computer runs.

A programmer learns the high level language and becomes fluent in it. They can read and understand the functionality of that code. With the complexity of the machine specific implementation stripped away.
Leaving just the precise functional maths and english / symbology that describes the computer functionality. They think in that code, in order to write it.

Unlike English language, it can succinctly describe computer functionality in a few lines. 
Even then, the majority of a programmers time is spent debugging the high level language - and fixing what they have written to be bug free. Because it is difficult to think clearly in code, pre-determining  all edge cases etc.

The AI

A detailed English language description of what functionality is required. Plus the name of a high level programming language, are submitted to the AI tool.

It does a search of the web, eg. stack overflow etc. for results for that code language. For Chatbot use (eg. ChatGPT) it applies an English language Large Language Model, LLM (a numeric encoding of learning of the English language) to generate a well phrased aggregation of the most popular results that match the English prompt. 

For software use (eg. CoPilot) it works just the same, but the LLM learns English to high level software language aggregate translation. From code examples data, eg. github, to generate what the code syntax might be to match the English description of it.

Finally it returns an untested snippet of generated high level code.

The Non-Developer

The non-developer pastes it in place and tries to run the program with it in.

They may be able to puzzle out the high level language - but don't naturally think in it, just as people without mathematics skills can only think as far as basic arithmetic and are dyslexic when it comes to complex equations.

It seems to work around 50% of the time. When it fails they, go back to square one and try to rephrase their English prompt. 

They patch together block after block of prompt created generated code. A crazy paving of a program that likely has a number of bugs and inappropriate features in it. But it kind of works, for the non-developer, that is good enough.

The code gets pushed out there with all its imperfections, and starts to populate the web of code data that is used to generate the next AI code snippet.

Or The Developer

The developer reads the code and understands it, determines if it should do what they want. Or if they just want to use some of it as an example.

They cut paste and rewrite it, using it as a hint tool. Or an extension to their IDE's existing auto-code generation tools that work using templated code and language / import library searches.

Hopefully their IDE is set up to clearer distinguish between real code completions and possible generative code completions. Since otherwise the percentage of nonsense code created by the generative AI pollutes the 100% reliability of IDE code completion, and harms productivity.

Then they run their code and debug as usual.

At least 75% of programming time is not on writing code, but on making sure that the high level instructions are exactly correct for generating bug free machine code. So iteratively refining the lines of code. With code a single comma out of place can break the whole program. When language has to be so carefully groomed, succinct minimal language is essential.

For many developers adding an imprecise, non mathematical language, that is entirely unsuited to defining machine code instructions, such as English, to generate such code is problematic. It introduces a whole layer of imprecision, complexity and bugs to the process. Slowing it right down. Along with requiring developers to write a lot lot more sentences (in English) rather than just quickly typing out the succinct lines of Python (or similar) programming language they have in their head.

The generative AI can help students and others who can hardly code yet in a computer language, but can it actually improve productivity for real, full time, developers who are fluent in that language?

I think that question is currently debatable. Because I believe the goal of adding yet another language to the stack of languages that need to be interpreted for humans authoring computer code, especially one as unsuited as English, is only useful for people who are far from fluent in the software language.

Once we move beyond error prone early releases of LLMs like ChatGPT-4 then tools such as CoPilot may start to become much more effective at authoring software, and actually produce code that is as likely to work first time and have the same amount of bugs as your average software developer's first cut of the code. We may reach that point within a few years. At which point professional software developer will need to be adept at using it as part of their toolset.

Even so I believe the whole conception of the application of AI to writing software could benefit from more work engaged in a computer centric alternative approach to the current one focussed on generating plausible human language responses. It only dominates because of all the efforts related to NLP and human interaction. But taking that and sticking on to writing human software languages is more about creating a revenue stream than attempting to have AI do the main work of software development.

Until then, AI will never be able to replace me, as a software developer. Only be another IDE tool I need to learn ... in time when it improves sufficiently to increase productivity.

Another Way

Copilot and the like currently use the ChatGPT approach of a Chatbot front end tied to an English language LLM to generate aggregate search engine results in a human language. But there is no domain specific machine learning knowledge about the semantics of the content. So it doesn't understand, and certainly doesn't pre-check the code. Just as ChatGPT doesn't understand the search engine content. Since currently there are no domain specific trained models for the content in the loop. So if asked a question about pharmacy it doesn't plug in one of the AI models that has learnt pharmacy and is used by that industry to aid in the development of medicines. It understands nothing, it is a chatbot, just a constructor of plausible answers based on search popularity.
Similarly CoPilot has learnt how to predict what code somebody might be trying to write, but it hasn't learnt how to code.

This approach cannot lead to AI generating innovative new coding approaches, full self-coding computers, or remove the need for human readable high level programming languages.

There have been experiments with applying test driven development to AI generated code, but I have not heard of serious attempts to address the bigger picture...

  • Move all functional code writing to be AI only.
  • Remove the need for any high level computer language for humans to gain fluency in.
  • Have AI develop software by hundreds of thousands of iterative composition  TDD cycles.
  • Parallel refactoring thousands of solutions to arrive at the optimum one.
  • Use AI that understands the machine code it is generating by training it on the results of running that code. 
  • The ML training cycle must be running code not matching outputs to pre-ranked static result training sets.
  • In addition to the static LLM that encodes the learning of machine code authoring, dynamic training cycles should be run as part of the code composition. Task based ephemeral training models.
  • Get rid of the wasted effort training AI to understand English, Python, Java, Go or any other existing human language evolved for other tasks.
  • Finally we are left with the job of telling the computer what its software should do.
    We do not want to use English for that, its way too verbose and inaccurate, similarly we don't want a full high level programming language to do it. We need a new half way house. A domain specific language (DSL) for defining functionality only, designed for giving software specification's to AI that it can use to generate automated test suites.

Self-Programming Computers

Exploring the last point in more detail...

Create a higher level pseudo-code language for describing the required functionality that is more English readable than even current high level languages such as Python.

Make that functional DSL focus on defining inputs and outputs - not creating the functionality, but creating the black box functional tests that describe what the working code should do.

Maybe add tools for a slightly no-code approach, with visual generators for the language, eg graphical pipeline builder tools. For people who find thinking visually easier than thinking symbolically.

The software creator uses the DSL to create an extensive set of functional definitions for a project.

The DSL language design and evolution is optimised for LLM interpretation.  So it has very tight grammatical and syntactical usage that promote accurate generative outputs.

A new non-developer friendly high level pseudo code language / rigorous AI prompt writing lingo.

Some basic characteristics of the DSL:

  1. auto-formatting (like Go) minimizing syntactical variation
  2. To quote Python's creator - 'There should be one-- and preferably only one --obvious way to do it.'
    But strictly applied, rather than as a vague principle as Python does
  3. unlike any other high level language, the design needs to be optimized only for specifying functionality, a high level templating language from which test suites are generated.
  4. the language will never be used to implement functionality
  5. uses simple english vocabulary and ideally minimal mathematical symbology

These DSL prompts are written with a LLM for the DSL it helps create its own prompts and the code creator uses it to refine all the DSL definitions that specify the full functionality. 

The specification DSL auto generates all the required tests in a low level language.

Since the system should also have a generative AI LLM trained for C or assembly language.
This is what creates the actual functional code by iteratively running and rewriting it against the specification encoded into the tests.

The AI tool then generates the tests for that implementation and uses TDD to generate the actual functional code - eventually the system should improve to a level better than most software developers. The code it writes no longer needs to be read by a human - because a human will be unable to debug it at anything like the speed the AI tool can.

So we use generative AI to do the part of the job that actually takes all the time. Debugging, refactoring and maintaining the code, making sure it really does what is required functionally. Rather than the quick job of writing a first cut of it that might run without crashing.

Most importantly we don't introduce the use of the full English language, the language of Shakespeare, the language of puns, double meanings, multiple interpretations, shades of grey, implied feeling and emotions, into a binary world to which it is entirely unsuited.

Also we don't need English or high level computer languages in the stack of mistranslation at all.
Because we are not training the AI to understand human languages. We are training it to write its own machine code language based on defining what behaviour it should implement.
BDD / TDD generative AI if you like.

Human's no longer learn complex mathematical process based languages that can be translated into machine code. They learn a more generic language for specifying functional behaviour.

This gives more freedom to widen the DSL to mature into a general precise AI prompt language.

Whilst allowing computers to evolve more machine learning driven software architectures that are self maintaining and not so constrained into the models imposed by current human intelligence and coding practise based programming languages.

Could AI could take my job?

Perhaps if all of the above were in place, then finally we would arrive at a place where AI could replace traditional software development and high level software languages.
With concerted effort it could be in 10 years, if some big companies put some serious investment in trying to replace traditional software development.
Code monkeys will all be automated. Only software architects would be required and they would use a new functional specification AI prompt language, not a programming language.

Of course if politicians are scared that dumb ChatGPT can already write as good a speech as they can. Plus replicate all the prejudices and errors of its training data and trainers.
Then setting AI free to fully write software, and itself ... will be way more scary in its long term implications.

Meanwhile we are currently at a place where it arguably doesn't even improve productivity for an experienced software developer, only allows non-developers, students and other language newbies to have a go at writing one of the many dialects of human languages, known as computer languages. 

Their mix of math, english, symbols, logic and process may appear more like English than Musical notation or pure maths, but sadly they are no more suited to creation by an English language Chatbot approach.

Wednesday 5 July 2023

Sustainable Coding, and how do I apply it to myself as a Cloud engineer?

 I work as a developer of a Cloud service, Big Animal - EDB's Cloud Postgres product. So I went along to a meetup the other day, a panel discussion on Leveraging Cloud Computing for Increased Sustainability

It got me thinking about this whole issue, and how in a practical sense I could do anything that might reduce the carbon footprint of the service I work on. 

The conclusion I came to was that I don't really know ... and to some extent neither did the panel, so Cloud computing may give you some fancy tools to help assess these things, such as Microsoft Sustainability manager. But there are no black and white answers as to what would make something more sustainable - even the basic one of - run it in the cloud or on prem,  very much depends on what you are running and how. For one or other to work out as the more sustainable.

So on a global scale just how significant is computing as a percentage of global energy consumption and emissions?

The Cloud Climate Issue

Comparing today with 30 years ago is useful in terms of seeing where we are going...


1990s vs 2020s IT as a proportion of global energy and emissions

  • 1990s Energy 5% (Most from Office desktop computers and CRTs)   - 2% emissions
  • Today 8% (Most personal devices, laptops and mobile, includes 2% data centres) - 3% emissions

  • Compute power / storage  is around 30,000 times greater (by Moore's Law)
  • Data 16 Exabytes (EB) has grown to 10,000 EB so 600 times, with the majority in the last 3 years

Today data centres (hence the Cloud) is causing 2% emissions, as much as the whole of IT in 1990 and as much as today's aviation industry.

So working as a cloud engineer looks like a poor choice for someone concerned about climate change!

But on the face of it we have been pretty efficient, our compute and storage has massively increased, yet consumption + emissions only by 50%. But the issue is the acceleration in usage, which means we could double energy and emissions in 20 years, if nothing was done to improve sustainability.

The increase in compute power has remained fairly consistent since the advent of the transistor making Moore's Law, more a law of Physics than of human behaviour. Although of course that technology is now at its limits of miniaturisation. So the energy and emissions consumed per Gigaflop of compute has drastically dropped - but now everyone has the compute power of a supercomputer in their pockets.
The first supercomputer to reach 1 GFlop was the Cray in the 80s, by the 90s an IBM 300 GFlops supercomputer beat Gary Kasparov at chess - today a Google Pixel 7 phone is 1200 GFlops.
Hence our consumption has rather outstripped our increase in compute.

But it is the explosion in data that is a story of human behaviour.  Hand in hand we have reduced costs of cloud storage and monetised personal data. With software companies valued based on how many customers, and more importantly how much customer data they have. Recent advances in AI have proved the value of big data lakes for training models to produce practical ML applications. 

Combine that with the problem of induced demand. The more and bigger roads you build, the more traffic you get. Cloud puts a 6 lane highway outside everybody's front door.

How do we measure sustainability

So within the world of commercial sustainability, and carbon off setting, there is a basic concept  to categorize things as scope 1-3 emissions.

  1. Scope 1 covers emissions from sources that a company owns or controls directly.
  2. Scope 2 are emissions that a company causes indirectly and come from where the energy for the services it purchases and uses is produced.
  3. Scope 3 encompasses everything else. So suppliers energy use, etc.
The assumption is that raw energy consumption is not the issue. It is the generation of climate changing emissions to generate that energy that is the metric.  

This includes mining for minerals to build laptops and data centres, etc. But if you run your own green energy solar farm next to your data centre, and that is direct powering, without any significant battery storage. Plus feeding energy back to the grid, you can be pretty much carbon neutral. You can also fund renewable energy projects and offset. 

So strangely perhaps, given how full cooling can treble hardware life span. The biggest data centres in the world are currently built in the world's desserts rather than at the north pole. Solar and wind can be relied upon for more than 100% power.
  • Microsoft Azure was carbon neutral in 2012. It is aiming for 2030 for its whole business (then for 2050 to removing all its carbon debt since it was founded in 1975)
  • Google Cloud became carbon neutral for all its data centres in 2017, also aiming for 2030 for all its business.
  • Amazon is aiming for AWS cloud neutral by 2025, and as a global retail supplier to do the same for its whole business by 2040.
Of course this is not possible in most European countries, so most carbon neutral data centres in Europe will be from purchasing carbon neutral generated energy, rather than actually being neutral in themselves. Although some go a long way down that road, to partner with renewable energy suppliers and tick a number of other sustainability boxes. The problem is if data centres are buying up lots of the renewable energy supply at a premium, then that means they are removing it from residential or other uses. So this is hardly helping global sustainability and in reality means they are far from neutral.

Also carbon neutral means only that scope 1 is covered. Net zero is a standard above carbon neutral where to deal with scope 2 and 3, emissions must be taken out of the atmosphere. So that in practise only a net zero supplier is actually contributing nothing to climate change.  No cloud provider is net zero.

A key point is that the latest enormous scale cloud provider data centres are not the main source of emissions, it is all the older, smaller, more local data centres and machine rooms of servers that are causing the majority of the emissions. In the same way that car pollution is disproportionately down to older vehicles. Of course there is the manufacturing footprint to consider for cars that can last 40 years, but all computer hardware has a much shorter lifespan of 3-5 years. Obsolescence makes increasing the lifespan uneconomic. Another green issue that could fill a blog post on its own.

So moving to cloud provider's services and migrating any remaining on prem to the cloud is the sustainable thing to do, as long as what is moved is suited to cloud, or can be re-architected for the cloud.

What changes, as a developer, could improve sustainability?

Life Style

So the obvious thing that people think of, is the nature of their employer's work. Or perhaps if your company is a B2B one, do they have green standards wrt. the clients that they work with. For example it may not make sense working for ExxonMobil as the company with the world's largest emissions. Perhaps the tech industry equivalent would be working on Crypto currency? But Blockchain developers are working on that reputation, even coming up with useful uses for it!, such as auditing sustainability usage for scope 2 and 3 verification. 

Over half of internet traffic these days is video streaming, so stop watching Netflix and scrolling on TikTok - and read or listen to books instead is maybe a good behaviour 😉
On the plus side porn has dropped from its high as 25% of internet traffic down to around 10%, but it has been more than replaced by cat and side hustle millionaire videos it seems. So if your side hustle is being a prolific social YouTuber, it may not be the most ecological of life choices. Since an hour long short story of digital text is 100 Kb whilst the same as a 4k video is a million times bigger at 10 Gb.

On a personal level, my previous employer was more office orientated. So it was keen to encourage people into the office with free food etc. it encouraged commuting in to work, and the maintenance of offices with permanent desk space for every employee, monitors, heating etc. and all the unnecessary extra emissions that entails. My current one is more remote-first.

In terms of remote work, having experienced pandemic lock downs in a city. I was going out for a regular cycle for exercise, so I can confirm that the reduction in emissions may have only been measured at 20% across the whole of Britain, but in the cities it felt more like 50% - the air was so much more breathable.Whilst maximising WFH is not equivalent to pandemic lock downs, it does make a difference. So changing jobs in the tech sector, to a full-time remote position, is certainly a worthwhile contribution to sustainability.

There is the argument that if we all lived alone in big drafty castles which could be turned off for the day by packing into an office a walk away. Then remote working is not more sustainable. But the reality of IT work today, especially with hybrid working, is that the big fairly empty building you are more likely to be in these days, is the office.

So become full time remote if you can. If you have to work for an office based employer then choosing one that has hot desking, smaller offices, less frequent attendance and live within walking or cycling distance, are all part of being sustainable wrt. your tech job.

Sustainability for a Cloud SaaS company

I work for a company that produces a cloud market place software product, with most engineers working remotely and running no servers at all, just employees laptops, ie everything we run is via cloud providers services. We have a few offices globally but only a minority of engineers use them. Since all teams are largely remote, there is no office, no paperwork, commute or physical products.

So the same applies to all our other services, eg. from CI to presentations. From LaunchDarkly to our CRM. From expenses to online mental health support etc. Plus Slack and Zoom for comms.

This is a pretty common model, your could call it a server-less company, it was the same at my previous employer. We sell SaaS and we use it for everything internally too

Therefore the assumption is that the problem of working out 2 and 3 should be solved by those cloud providers, which to some extent it is ... maybe some less than others. But emissions data can be obtained for scope 2 and 3 from them.

So that leaves scope 1. This may be hugely affected by how much face to face sales and marketing goes on etc, but that is not my area. So I am purely going to focus on what options are there to improve sustainability wrt. to the software architecture, development and deployment practises available for producing a cloud based software service, SaaS. Since those are the areas that as a software engineer I can influence.

So lets break that down to some basic elements, and work out what are the more sustainable practises and approaches.

Cloud vs. On Prem

So first things first. Is working for a company that runs everything on cloud, and delivers a cloud based product a good thing, versus writing software for running in a local server room or data centre?

Assuming you use one of the big carbon neutral cloud providers and are using virtualisation to scale capacity efficiently with usage. Then it is likely that a Cloud data centre will be run much more sustainably than a local data centre where you may house your own servers and certainly a local machine room. So even if you are running a specialised HPC data-centre where the majority of traffic is local ... third party providers will be able to offer more sustainable options.

Of course if you software is entirely unsuited to cloud virtualisation (k8s micro-services etc) or badly designed for it, you could actually be running up way more resources than a local monolithic solution on a few dedicated servers would. So sustainability goes all the way down through the architecture to the lines of code, and what they are written in.

A whole load of legacy software dumped onto the cloud can be less sustainable (and way more costly) to run than running it locally. 

So another sustainable employment decision, is to not work for an organisation that either has a lot of legacy software or has its  own servers or data-centres, or at least only if they are bigger than a soccer pitch (ie average DC size or bigger) and have their own adjacent wind farm or other local renewable power source.

But if like my employer, everything is run on the three major cloud providers, and there is very little in the way of scope 2 and 3, then is the sustainable business box ticked already?

Unfortunately not, as mentioned, they are not net zero and ~2% of global emissions are from running data centres, so whilst that may be disproportionately from ones that are not the self powered giant DCs used by the big cloud vendors. Being as efficient as possible wrt. use of Cloud is still the key to being a sustainable tech worker. Especially with the projected growth in Cloud and its emissions being a significant ecological concern.

Choice of software languages


So the reference paper often quoted (and misinterpreted) for software language sustainability is this Portuguese University paper on Energy Efficiency across Programming Languages.

Where we could perhaps regard sustainability as the combined goal of minimising energy = performance, time and memory usage (table 5. in the paper). So leaving out the older / less mainstream languages we have ...
  1. C, Go
  2. Rust, C++
  3. Java, Lisp
  4. C#
  5. Dart, F#, PHP
  6. Javascript, Ruby, Python
  7. TypeScript, Erlang
  8. Lua, JRuby, Perl
So on that basis we should write everything in C or Go or possibly Rust, maybe even Java if we are not that eco-friendly.

Whilst I do use Go for writing Cloud micro-services, I think the paper's focus on executing a few specific algorithmic performance tests is maybe not an entirely representative approach.

I have been a Python developer for 20 years and Python is ranked almost last for speed. 75 times slower than C at the top spot. But even if this were the case across all uses, then it ignores the fact that for compute heavy tasks where Python is employed in number crunching, it uses high performance libraries for the core processing functions. So Numpy is half C and runs all the big matrix manipulations in C.

Hence the API coding and setup is in Python but it is not actually running everything 78 times slower than C, it is running maybe at worst, half the speed of a pure C program. Plus that custom pure C program could well have taken a lot longer to write and be less reusable, so in total use way more energy than a Python version would. Especially for short lived code and Jupyter interactive coding orientated use cases such as used in the science and finance sector. 

There are further optimisation approaches such as Numba, when Python is being used for fast computational use cases which can compile straight to CUDA machine code for GPUs. 
A paper comparing Java, Go & Python for IoT decision making. Similarly puts Go at the top for efficiency, but places Python above Java (presumably Python was using SciKit hence C for performance critical algorithm execution). So clearly the use case and the methodology of the study, can make a huge difference in the measured efficiency.

The same could probably be said for a number of the other languages languishing at the bottom of the table. If measured for executing a real world use case rather than a pure language implementation, the results can be much improved.
However for very nimble light weight micro-services then a directly compiled language like Go is going to use less resources than languages using JIT VMs and/or an interpreter. 

Then there is the core point that most applications in the cloud are not highly intensive calculation based ones. The performance of the majority of applications are more likely to be due to the data I/O on the network between services and storage. Where raw algorithmic performance has little impact.

What does matter is that running up parallel processes is simple and lightweight.
That core feature, along with the simplicity of Go and its small footprint were designed specifically for cloud computing. Which means, becoming a Go programmer, or at least learning it. Is a good choice for the more sustainable programmer.

It is also why ML/Ops will often use Python at the development and testing stages of ML models, but then switch to Go implementations for production.


Software Architecture


The architecture that is deployed to cloud has a huge impact on the efficiency of a cloud service, and hence its sustainability. Certainly it is going to have much more impact on energy wastage than the raw algorithmic performance of the language used.

The architectural nirvana of cloud services are that they are composed of many micro-services, each managing a discrete component of the service's functionality and each able to scale independently to provide a demand driven, auto-scaled service that ramps up and down whatever components are required from it at any given time. Morphing itself to provide always sufficient capacity. Not needing stacks of wasteful hot failover servers running without a job to do. Not getting overloaded at peak and failing to deliver on uptime.
The ideal sustainable use of hardware, always just enough. Virtualisation allowing millions of services to ramp up and down across the shared Cloud provider DCs vast hardware farms.

Clearly, combined with Big Cloud using the latest carbon neutral DCs, this ideal is much more sustainable than each company running its own servers and machine rooms 24/7 on standard grid non-renewable power, for a geo-local service that only approaches full capacity twice a day, and could probably be happily turned off 6 hours a night with nobody noticing.

From this perspective, one the big cloud vendors are keen to promote, Cloud is the sustainable solution not the problem.

Unfortunately that ideal is often very far from the reality.

Software that is essentially monolithic in design can end up being lifted and shifted to the cloud with little refactoring. At best the application is maybe chopped up into a few main macro-services. UI, a couple of backend services and data store as another. Then some work done to allow each to be deployed to Kubernetes as pods with 3 or more instances in production. Ideally the replicas are identical in role and have good load balancing implemented, or multi-master replication for the storage. But often primary-replica is as good as it gets.

Essentially an old redundant physical server installation with a few big boxes to run things is being re-implemented via k8s. Then repeat that per customer, usage domain, geo-zone or whatever sharding is preferred. Big customers get big instance's - the providers have wide sizing ranges for compute, storage etc.

Its better than just setting up a VM to replace each of your customer's on prem boxes - and basically changing almost nothing from on prem installs, but any increased sustainability is only that provided by the Cloud vendor's DCs. The solution is not cloud scale with auto-scaling, its repeated enterprise scale with a lot of fixed capacity in there.

For these cases maybe consider swapping out some elements with a cloud provider scaled service, eg the storage. Whether that is by using the Cloud provider's solution or a third party vendor's market place one.

Even for software that has been freshly written for the cloud there can be architectures that consume excessive resources and are overly complex, some times because of the opposite issue. So with the budget to rewrite for cloud, then developers can leap too fast for all cloud scale solutions - when the service has no need of them. For example deploying multi-region Kafka for event streaming and storage, when data could happily have been sharded regionally and put into a small Postgres or MariaDB cluster. 

Repeatedly firing up a 'micro-service' k8s job that is very short lived but uses a big fat monolith code base, so that 80% of the time and cost of the job is in the startup. This is where language matters more, the lighter and faster the language, the smaller the binary and its memory usage, the better.

The use of gRPC between micro-services provides 10 times the speed of REST, which can be reserved just for the whole service API to the UI and CLI.

One key indicator of waste is the obvious one of cost. If your new cloud deployed application is generating usage costs that work out far more expensive than the TCO for its on prem deployment. Then its architecture is not fit for cloud use. You are burning money and generating excess C02.

Sadly with architecture it all depends what suits the scale and use cases of a service. So there is no simple fix it advice here.

Development, Testing & Release practises


Testing and release are probably the most important area of Cloud software development that could benefit from more sustainable practises. This is perhaps more a pitfall of the rise of Docker and infrastructure as code, rather than Cloud itself, but the promise of replicable automated built software environments has delivered. 

What it has delivered is a development to production life cycle where developers can spin up any number of their own development environments - even one per Jira ticket, automatically built on its creation perhaps.
In order to get merged with the release code your team choose to run the full E2E suite. It takes a little while, but we can speed it up by running the 5 clusters we need in parallel for each test environment case. These also standup the whole environment, load it with fixture data and run E2E tests on it, maybe some infrastructure ones too, that failover the storage and restore from backup.
But at least they should automatically teardown the test clusters at the end, where as dev clusters can hang around for months without cleanup.
Then once it passes it goes out to the dev environment which has its own testing barriers for release to staging. Staging should be the same hardware resourcing as Prod so that it properly tests that it is working for it, perhaps with some load testing or maybe that is done in another set of clusters.
Finally it gets to roll out to production, but maybe for safety prod has a set of canary environments it goes to first, for final validation before it can be rolled out to customers.

So to get 20 lines of code into production. We could easily have a process like the above, that involves spinning up over 10 temporary k8s clusters and uses hundreds of longer life ones. Just running the E2e and infra tests will take over an hour.

This is seen as good practise in the Cloud world. Rigorous testing before release to production. It is pretty common for companies producing a cloud service. Since most software companies now have to produce a cloud version of their product to satisfy the markets then that is a lot of companies. For the first year or so, all this will be run at the cost of millions of dollars, with hardly any customers using it. Because that is what you do. Agile, get the product out, then grow and refine it and the team developing it. Build it and the customers will come.

This is a hugely wasteful process, and it is not far from Crypto in terms of generating emissions, for something that has no practical use yet.

If we do end up with a lot of customers fine, but for services that are not multi-customer architecture, ie big revenue small customer numbers, there may well be customer specific customisations of the product ❄❄snowflake alert❄❄ So the easy option is the duplication of as many clusters in dev and staging as are in prod, to cater for fully testing for those big clients. So a great deal of duplicate resource spend.

So there should be a lot more consideration of sustainability when establishing the above practises for the development to release cycle.

One way to address this issue is to push as much testing as possible down the testing pyramid.
Unit testing is less useful for cloud since the whole point of Cloud and micro-services is to do only one thing and knit together via API calls the full service. Which means there may be very little functionality that can be tested by a unit test, since everything needs to be mocked.

However that doesn't mean that things cannot be faked, fakes allow fast functional testing of micro-services. Fakes can mean the full emulator's of services, eg. Google pub sub. Or running your gRPC services over its test fake, local memory, bufconn.

But the aim should be to establish a full fake test framework that can run up your service on your laptop. Ideally without the need of a k8s fake like kind to stand it up. Since we don't want to fake the deployment environment - just the running code. Functional tests can then be written that can be used like unit tests to check PRs pass in seconds as part of a git workflow. Running those same tests at regular intervals against full deployments can validate that they correctly mimic them.

There should be layers of tests that validate code before the E2E test layer and do not just have unit and E2E, since then the validity of the code relies on full deployment testing. Full deployment testing should just be run as part of the release process,  it should never be run at the PR validation level, it takes too much time and energy.

Developers can have reasonably long lived personal dev clusters not one per PR, maybe even resort to shared dev clusters per team, to reduce spinning excessive amounts of cloud resource for development.
Automated shutdown based on inactivity should be the norm.

Time should be invested in developing good sample data for non production environments. They should not resort to duplicating all customers, regions or whatever sharding. Plus a bunch of test versions of them. If you have more things running in dev than in prod, you are doing things wrong.

Another route to take is to only have production for long lived deployed clusters. With temporary clusters for automated testing and the use of feature flags to cater for final stage testing in production sandboxed feature enabled clusters, prior to full release. So this separates deployment from release - the latter can then be moved outside of engineering, once a flag has passed testing and validation.

Temporary clusters can use tools such as vcluster for automated short lifespan k8s clusters, significantly reducing the resource usage and speeding up the spin up time, for dev clusters. Hundreds of pseudo separate k8s clusters for dev and testing can be run in a single k8s cluster.

Anything else?

The explosion in data is not just all video streaming. Observability is a huge topic, the amounts of telemetry and logging that a well SRE engineered service needs can be overwhelming. Clear management of that, and limits on retention (at least outside of cold - ie tape / optical - storage) are essential. Such things as the ability to turn on higher info debugging levels for very restricted sets of environments. Provide valid ML learning data sets without filling up data lakes of hot storage, etc.

There are still so many more things that impact Cloud sustainability in terms of Cloud applications ... however this blog post is already unsustainably long 😀. So I think I should end it here.

The main point is Cloud can be the sustainable option, but only if cloud engineers put in the effort to make it so, by pushing for the most sustainable architecture, development and release practises in our every day work.





Saturday 13 May 2023

Ten Software Engineering Managers

Engineering management from the perspective of the managed.

I have worked in the software industry for many years. Working in both the public and private sectors as an individual contributor software developer, SRE and cloud engineer. 
Along the way I have been managed by 15 different managers, along with having work interactions with around another 100. So naming no names I thought maybe it is worth distilling my managed life into a set of software manager caricatures.
To illustrate what makes a good software manager (and a bad one).

I accept that since I have never chosen to become one, then I am criticising a skill I have shown no interest in acquiring myself. I have only partially dabbled in it, via senior IC roles, ie Staff Engineer technical advocacy. However it is a good chance to let off steam ... and possibly a software manager may read this and reflect on which characteristics they may want to work on. So guys its time for your 360!

Remote managers are better

Having worked for many years in offices with my manager sat at the desk behind me, literally looking over my shoulder. I should probably also admit that in recent years I have chosen to work full time remote, ideally with my manager in another time zone, or at least another country. Luckily I got on with those managers that were literally breathing down my neck. But it certainly didn't help me get the job done.

Full time remote is probably not so good an option for those who are just starting their career. However for established engineers it does tend to insulate you from the various forms of toxic management out there and lets you get on with the job. It also requires you to be more explicit about the engineering process, collaboration and documentation, and hence be more productive in a more maintainable manner.

Bullying, micro-management, under-mining, overly personal etc. Although these can all still happen in front of colleagues on Slack and Zoom. But it is easier to shut down a conversation and walk away virtually so that the manager has time to calm down and control their behaviour. 

Manager's skilled at managing remote international teams have to be skilled at targeted succinct and effective communication. Especially if they only have a few hours in the day when the team members time zones overlap. 

So on average I have found remote managers to be better managers. Although that may just be that the UK is renowned for bad management - both anecdotally and in terms of the UK's productivity ranking. So managers that are from more productive countries than the UK are likely to be from a better management culture - hence better managers.

Peer level managers are better

In a traditional hierarchical organisation. The CTO is the most senior manager and so remunerates people more who are managers like themselves. So particularly in the public sector there is an unwritten rule that even the most junior manager must be paid more than the most senior engineer.

This approach naturally leads to some rancour amongst senior technical staff who want to stay technical. It also devalues technical skills. Since to increase their salary technical staff must eventually give up all their years of technical skills and somehow gain 10 years of skill in people management overnight. Of course this doesn't happen so instead you get very novice people managers with a lot of largely irrelevant technical skills, and perhaps a personality totally unsuited to enabling their team.

It is easy to filter out such an organisation. Just get on one of the company feedback sites, eg. Glassdoor, Fishbowl etc. Check what the top IC engineers salary range is and check that it is higher than the junior management range. Ideally you should expect the top IC grade, eg. senior technical architect to be paid around 50% more than an engineering manager. But take it as a red flag if there is no technical grade that is paid more than any management grade.

Because it means the organisation doesn't value your engineering skills, and you are likely to be managed by someone who doesn't value them either and may regard themselves as your superior. So why would you work there? Surely better to work for a manager who treats you as a peer in an organisation that values your skillset.

Most people quit their job because they have a bad manager

The surveys tell us 43% leave their job because of a bad manager. With the next most important reason being general toxic culture / under appreciation. Whilst pay and progression comes in third.
We all want to get the number 1 manager, but unfortunately we often end up with 10.
So most people leave their jobs because of getting one of the bottom ranked manager types.

HR's job is to minimize disruption to the company, they will tend to take the side of the more senior employee unless that employee's behaviour is clearly proven to be detrimental on the wider scale to the organisation.

A few junior employees reporting them for incompetence or abusive behaviour, does not usually qualify.
So if you do complain then it is unlikely to improve the situation, and may well make it worse if the manager is informed that you complained about them. Sadly I have not personally heard of any case where complaints about bad management resulted in resolving a problem, but definitely have heard of cases where it made it a lot worse for the remaining time before the complainant's notice period was done.

So unless a more senior member of your organisation is on your side, and decides to deal with the problem manager. It is best to leave your job if you have a bad manager. For a big company that may just mean changing departments. But leaving your job is safest in terms of removing yourself from a toxic environment, plus you can honestly report the manager as being the reason for leaving in your exit survey. It is a chance for you to help your ex-employer, you shouldn't expect them to act upon it immediately. But given a sufficiently high attrition rate for a manager's team, the higher level managers should hopefully have enough sense and self interest to deal with their failing colleague.

Ten Software Managers

1. The Enabler 

The ideal technical manager is the enabler. They have better personal skills than software skills.
They are most likely to either have never been an engineer, ie they entered the industry as a professional technical manager. Or they were an engineer some years ago but are more interested in the people than the code so changed direction fairly early into their career.
They will be well aware of the wider environment of the organisation and the stakeholders and drivers for the technical work. A great communicator. With knowledge of all the systems and processes and so how to unblock things and get stuff done procedurally. Plus be a talented scrum master.
 
Most likely to say:  How should we fix that process for you?
Catchphrase: Thanks for your contribution 😊

2. The Big Cheese

You may find at some point in your career you get a manager who is actually a much more senior person in the organisation than their surface role as an engineering team manager. Maybe they founded the whole company, or they are the head of large department with other non-software engineering work.

To be high in the organisation they are likely to be a better manager than the average manager you might expect, with great people skills.

(Obviously this rule may not hold once you get to CEO level, I hope nobody who ended up with Elon Musk as their direct line manager is reading this ... although I don't think he was interested enough in software to directly manage software engineers, given he gave up coding at 20 before getting any formal training in it.)

They are likely to be more of a leader than a manager, but likely to be particularly good at fostering the development of their engineers. They are also going to have all the contacts to unblock any issue that may arise, plus have their finger on the pulse of high level changes and strategy that may impact your job.

They may still have a surprising high level of technical understanding of the company's software, but as a senior manager they also understand that their job is all about enabling an even higher level of understanding, and technical decision making about it, in their engineers.

On the down side, they probably don't actually have that much time to devote to you personally, so don't be surprised if they fail to act on things you have suggested to them. If there is any issue they make damn sure somebody in the team takes responsibility for sorting it out. So be prepared to be volunteered by them.

Most likely to say: How are you aiming to progress in the company?
Catchphrase: Isn't it amazing what we are building at (fill in organisation name)? 🏆

3. The Bro 

They were happily doing their engineering job when the manager for their team left. They were not necessarily the most technical member of the team, but they were the one who got on with everyone and the one that everyone in the team was happy enough to have as a manager. So they took the job.

They want to be you best friend and genuinely aim to protect you as a member of their team from problems or issues that are coming down from higher up the management hierarchy.

They are reasonably technically aware and skilled, but don't try to make the technical decisions or deal with issues unless nobody in the team steps up to the plate, then they take the task on themselves.

They are just one of the gang, but your manager too.

Most likely to say: Let me buy you a beer ... umm, sorry, everyone in the team has been made redundant.
Catchphrase: Yeah, what the hell are management up to 🍻

4. The Middle Manager

They used to be a techie, many years ago, but they weren't really interested in tech and hence were probably pretty mediocre at their job. But thankfully they got on the management track, result! They love the politics and intrigue of management way more than technical details. They have good people skills but find it hard to hide the fact they have absolutely no emotional investment or interest in technology.

They literally couldn't care less if the head of the organisation declared that from this day forth all software in the company will be written in Cobol and all open source was banned from use. If that is what their boss says, then their job is to listen to their team moan and complain about it. Then tell them that is what they will be using. Since any technical objections the team gives are meaningless to them.

On the plus side they do not micro-manage and they appreciate their team members skills, and are good at bringing those skills out.

The middle manager likes people and wants to please them. But knows that their job is only to please their superiors. From their team they just need compliance, and being good at their job is all about bringing their team onside, with whatever the higher ups require. However outlandish.

Most likely to say: Sorry it wasn't my choice, but come on, we need to get on with it.
Catchphrase: I raised your issue with senior management, but no go. 😢

5. The Invisible Man

The invisible man works in a big organisation and knows the value of his super power. He used to do a bit of useful work, but that was years ago, when he still took enough interest in his job to get to the bottom rung of the management ladder. But over the years he realised he could get away with doing less and less actual work. He mastered quiet quitting years ago, before it became a thing. 

Since then he (and his boss) has worked out that if he gets a team of reasonably senior self starter engineers, then they don't actually like or want to be managed. So a difficult opinionated team for some managers, are actually perfect for the invisible man. Ideally if they are a distributed team, then he can "manage" them remotely, with the minimum level of work. Send the odd email maybe, do one or two Zoom calls a week and his work is done.

His team may not respect him, they may even play jokes on him. But he so doesn't give a crap about work that he won't even notice them doing it. As a manager the invisible man is the mid-point of the top 10.
He is neither a good or bad manager, he is like having no manager at all. He will never support, challenge or rebuke you. At least nobody every quits their job because they have the invisible man as their manager.

Most likely to say: Sorry just had to step out for a minute.
Catchphrase: Keep up the good work guys. 👻

6. The Over Employed

The over employed is a people pleaser. They like to say yes to everyone, including their managers and their team. They like to be seen dashing about doing things. So sure they will do that for you tomorrow ... but tomorrow never comes! Because the over employed is too busy. They may even have got themselves a second job on the sly, thinking they can juggle both at once. They are such an optimist, of course it will work out. 

They carry on saying yes to all those tasks that you need them to get done, to unblock your work. Just as they do to everybody else. So sure, they will sort out your performance review. Talk to the other manager that you need info from. But somehow they never seem able to deliver on time, if at all.

They will be there at 3am the night before a major deadline, chucking something together that is not quite finished and misses some vital component. But it is good enough, should do what is needed.

Poor people pleaser working their arse off to please people, so why is nobody that pleased with them?
But they stay cheerful,  not going to let those moaners drag them down.
Oh well if colleagues get too annoying they can always bail out. Get a job somewhere else and leave behind all those trivial little tasks that people kept bugging them with.

Most likely to say: Yes sure, I will do that.
Catchphrase: Hiya guys, why the long faces? 🤯


7. The Team Player

For the engineers in their team the Team Player is one of the best bosses they have had. They always have their back, supports them and persuades those above him to funnel more resources and authority to their team.

They are ambitious and aiming to rise higher up the management tree, but loyal to their guys. They know their team is really the only one in engineering that is run properly. It is also the one doing the work that matters. They makes sure they sweet talks those above them and dedicates a reasonable portion of their time to making sure they know that they and their team are the keystone keeping the organisation running.

They know that anybody looking to advance their career should be spending a good portion of each working day working on their own personal advancement. Don't be the fool who spends all day every day just doing the organisation's work.

Unfortunately their highly competitive nature and self belief, can lead to self delusion. They start to believe their own self promotional narrative. This leads them to be contemptuous of those annoying flakes in other teams who are not doing anything of real value. Though generally energetic and positive those outside their circle and below their grade, get the aggressive, bullying, dismissive and unpleasant side.

This behaviour also distorts the true importance and funding that the organisation should be devoting to their team's remit, to the detriment of other areas. So can cause problems for the company as a whole, as well as for morale outside of their team.

Most likely to say: My team has got that, we will save the day.
Catchphrase: Get out of my way, unlike you, I have a real job to do. 💪


8. The Bureaucrat

Once upon a time, long long ago. Some software companies decided they wanted to sell their wares to the ancient hierarchical institutions of government and learning. Those institutions believed in traditions and rules and processes and making things quantifiable. So something that did that for software management was perfect. So the companies came up with traditional names signifying regal wealth and power - Prince. Along with naming their software, Enterprise software, signifying new, wealth generating and boldly taking the initiative. That was in the 1980s.

It was the perfect sales pitch for these outdated institutions and they bought into it wholesale. Although it took them about 25 years to get around to it, old institutions are like that.

So their procurement processes and software and its life cycle and management were bound into reams and reams of bureaucratic processes. The IT managers in those institutions were groomed in the ways of Prince 2 and ERP and ITIL and all the rest of the snake oil the companies had invented. They devoted all their money and time to the training and meetings and processes around it. 

As far as the engineers in those institutions were concerned, a few of the processes were useful but that was far out weighed by the whole bureaucratic burden and costs they were wrapped in.

The manager spoke at length for years at far too many meetings of the process and the newly procured systems, but unfortunately the quality and features of the institutions systems seemed to have gotten worse. Whilst the cost of them became much much greater.
The institution employed more and more managers, although they just managed projects not people. But eventually there was so many, they needed managers for them too.
But they hired no more engineers.
Eventually the institution decided it didn't really need any in house software engineers at all, why were they writing software when they should be buying proper Enterprise software from the companies?

So the engineer realised they had to leave the kingdom of the Prince 2 and its manager and go to live in a different place altogether, where their business was making software. Strangely in the software republic they had never even heard of Prince 2. They vaguely knew of PMP, but nobody in their right mind would use it to make or manage software.  

Most likely to say: I am sorry I cannot talk to you, until you have filed a change request.
Catchphrase: Our KPIs show that we are on course for all our CSFs 👑


9. The Technical Superior

The technical superior is at least a grade or two above you and always interacts with you as your superior.
They were an engineer and still secretly preferred being an engineer to their current job. They were never the best technical engineer in a team, so they compensated for that by imagining they were the best at seeing the big picture engineering wise and still are. So they decided to become a manager to make sure the right technical decisions are made.
 
They probably preferred their old job because they didn't have to spend so much time on politics and relating to people. They have been a full time manager for at least 5 years and their technical knowledge and judgement have dated badly. However they still see themselves as the person most qualified to make technical decisions. The more senior they become, the more out of date their technical knowledge becomes, yet the bigger and more expensive are the technical choices they make for their organisation. 

Most likely to say: Never ask a bunch of developers to decide on technology, ask 5 and you get 5 different answers.
Catchphrase: We really need a big monolithic Java XML SOAP web service to do that. 🙈


10. The Rockstar Techie

The very worst engineering manager is the rockstar techie.

Rockstar techies have great technical skills and may have technically saved the day a few times for senior management. So their lack of personal skills are overlooked, and they tend to behave better to people above them in the management hierarchy anyway.

But in the long term they are damaging to the quality of your codebase, the more senior they become the more damage they do. So the common technical issues they can cause are blocking devolution of architectural decisions and diverse input into them. Possessiveness over code, or technical knowledge. Wanting to be the saviour for technical problems, outages etc. 

However the damage they do to the code is minor compared to the huge damage they do to the engineering team and culture if they are put in a management position.

They were often the most senior technical engineer in their part of the company. They have probably been there a while and to justify getting another pay rise they got lumped with doing management too. They regard management as an annoying burden tacked on to the side of their real engineering job. They aim to remain the lead engineer in the team and make all the final technical decisions. They cannot devolve technical authority and have no interest in picking up any management skills. So are most likely to exhibit basic level fails in terms of interpersonal skills, have a technical superiority complex, be rude, moody, bullying, micro manage etc.
 
As time goes on they must devote more and more time to management and yet cannot accept no longer being the most technically adept guy in the room. A paradox that can only be solved by driving away any members of staff who challenge them technically and effectively dumbing down the technical skills of the whole team.

Most like to say: You Fffing broke it you moron, when coming across a feature change implemented in a way they didn't expect.
Catchphrase: You are wrong, this is how it must be done, idiot 👹


(Any resemblance to persons living or dead is purely coincidental)

Friday 27 January 2023

Tech sector lay offs

INTRO: Having failed to post any technical articles for a few years I feel that my blog is at risk of dying from lack of attention. So to avoid that,  I have decided to mix up its content a bit and diversify from long Technical HOWTOs to more casual short posts whose tech content may vary (or not exist at all) ... so to kick off this one is a short rant about tech news!

Fake News about Tech Industry Collapse

A number of bigger tech companies are laying off staff at the moment. The press reports this as a related to them making terrible errors in misreading trends post pandemic and suddenly hiring way more staff than they normally would over the last year, 2022. Now reality has hit and the tech sector is awash with newly redundant workers as big tech desperately tightens its belt to survive. But is there any truth in either the premise of this argument or its conclusions?

A random recent example that repeats these ubiquitous assumptions ... https://fortune.com/2023/01/23/big-tech-layoffs-15-20-percent-next-six-months-top-analyst-says/ ... Citing the not for profit basket case, Twitter, being turned into a uniquely loss making zombie company by Elon Musk as though Google, Amazon and Microsoft had something in common with it as a business!

If you look at the graph of the employee count of these companies year on year. Then only Microsoft hired more than usual last year, 2022, and Amazon to cope with its pandemic boom did so in 2020. Google last had an uncharacteristic hiring spree in 2012. Similarly Facebook and other big tech companies growth curve exactly followed that of the last decade in recent years. There was no extra hiring.

So why the layoffs. Nothing to do with over hiring. Simple - look at the share price curves instead.
The market is valuing most big tech lower and recession looms - they need to chop staff to chop costs and make their finances look better to reduce that drop for their shareholders - the biggest of whom are the CEOs of those companies.

The big companies are still making billions in profit (they are not loss makers) so over the long term it would cost them less to retain talent, and they can afford to. However the CEO's personal short term loss in wealth is something that they can't stomach, and it is a good excuse for a clear out.


Obviously there are large loss makers, the most prominent of which is Twitter, but they are special cases with failing business models - Twitter only ever made a profit in the run up to Trump's election as it became a huge engine churning up ideological conflict with political and conspiracy fictions. Without a politically polarised USA to drive an explosion of lies and social media wars, it was always a loss maker. It has nothing to do with big tech trends. 


Apple is a real exception to the trend, currently, but only because its share price hasn't dropped significantly yet. Hence it hasn't done its layoffs yet.

So the real reason big profitable tech is laying people off is a temporary fix to save a few billions from the bear market's current swipe at the personal wealth of their CEOs. Even though being a slave to short term market driven fire / rehire cycles will cost the company more in the long term. It is purely a personal choice to save personal wealth, there was no over hiring, there is no need for redundancy, there is no down turn in the growth demand for the tech sector, there is just less cheap loans around to fund it.

 
The jump in the cost of borrowing is due to what looks to be a short term hike in interest rates, nothing to do with the tech sector itself. Largely due to Putin starting a war of revenge against his long dead ghosts from 35 years ago, when the Soviet Union fell as he presided over the KGB. A war against ghosts can never be won ... but unfortunately its far more terrible human cost will carry on for years.


This becomes clear on the other side of the market coin, late stage startups are often making no profit at all - because all profits along with loans are ploughed into growth. Because they must never be seen to be shrinking - to keep growing their valuation for IPO.


So a lot of software companies are hiring even though strictly they don't need to, whilst the big boys are firing when they don't need to.


Meanwhile the number of software jobs as a whole keeps growing at 10% year on year. The demand continues to outstrip supply and wage inflation follows. So if you are a talented (if overpaid) software engineer ... I wouldn't worry too much about the layoffs. It is just a chance to take a redundancy bonus and a 20% pay rise to try something new. Unfortunately for those that were used as a disposable foreign human resource by big tech, via job dependent visas, it is a different story. They may well not have time to stop their CEO's thoughtless greed needlessly disrupting their lives.

Many predictions are that interest rates will drop and the bear market end in a year's time. At which point the mass layoffs will be reversed. But the big tech companies will have lost a lot of money, and a great deal more trust, by following the market and each other so closely. The lesson employees will have learnt is that loyalty to such companies will never be rewarded or returned.


Thursday 13 February 2020

K8s Golang testing - Mocks, Fakes & Emulators

A lot of the Go code I write is developed against Google's Kubernetes API.
The API is fairly large and given that the code is mostly calling K8s then it inherently has a set of complex dependencies, these dependencies have time and costs associated to run up for real as K8s clusters in cloud providers data centres.

So how can we test K8s Go code ... or any Go code with significant dependencies. We must use substitute objects that simulate the dependency objects. There are three common terms for these substitutes known collectively as test doubles. They are mocks, stubs and fakes. Unfortunately these terms are all pretty interchangeable. So before I start bandying them about, I had better define what I mean by these and related terms for this blog post ...

StubA function, method or routine that doesn't do anything other than provide a shim interface. If a stub returns values then they are dummy values (possibly dependent on calling args, or call sequence) that are either fixed or generated for a fixed range.

Mock = An object which replicates all or part of the interface of another object, using stub methods.

Fake = An object that replicates all or part of the interface of another object, and has methods which are not all stubs, so some methods perform actions that simulate the actions of the real object's method.

Emulator (full fake) = A package that has a significant amount of faked methods (rather than stubbed ones). For example a database server will normally provide an in memory database configuration that will completely replicate the core functionality of the database but not persist anything after the test suite is torn down.
Normally an emulator is not part of the code base and may require service setup and teardown via the test harness. As such, use of emulators tends to be for integration tests rather than unit tests.

Spy = A stub, mock or fake that records any calling arguments made to it. This allows for subsequent test assertions to be made about the recorded calls.

K8s Go Unit Testing

Given that unit testing by definition should not need any dependencies, then the assumption might be that for dependency heavy code, most unit tests will require test doubles ... the follow on assumption is that double == mock.

Mocks

Hence a standard approach to this is to use one of the Go's many generic mocking libraries, such as https://github.com/vektra/mockery,  https://github.com/stretchr/testify or  https://github.com/golang/mock.

There are numerous tutorials and explanations available to get Gophers started with them, for example this walk through of testify and mock.

These mock tools all offer test generation from introspection of your API calls to the dependency, to reduce the maintenance overhead.

So for everything we write tests for we generate mocks that reflect our codes API and confirm that it works as we expect. However there is a problem with this, the problem is described in detail in this blog post by Seth Ammons or in summary by Testing on the Toilet

The issues with mocks are:
  1. Mocking your code's calls to an API only models your usage and assumptions about it, it doesn't model the dependency directly. It makes your tests more brittle and subject to change when you update the code.
  2. Mocks have no knowledge of the dependencies they are mocking so for example as Google's APIs change - your real code will fail, but your mocked tests will still pass.
  3. Mocks may use call sequenced response values, so making them procedurally fragile, ie changing the order of your test calls can break mocked tests.
  4. If you want to swap out one library with another for a component, then because your mocks are specifically validating that libraries API, your mocks of it will need to be regenerated or rewritten.
So what is the alternative... 


Fakes

Refactor your code to be testable by using interfaces.

Break things down into simpler interfaces and create fakes that implement the minimum methods for testing purposes. Those methods should perform some of the business logic in a simulated manner for them to better test the code and relation between method calls than pure stubs would. Your model of the dependency is direct rather than based just on your calls of its API, so arguably easier to debug when that model and the dependency (and your evolving uses of it) diverge. 

Use ready made Fakes

But back to K8s and Google APIs ... in some cases the Google component libraries already have fakes as part of the library. For example pubsub has pstest. So you can just add the methods required so that things work for your test. In which case faking can be simple ...


The client-go library has almost 20 fakes covering most of its components but the only other fakes already in the K8s go libs (that I could find!) are for pubsub and helm

cloud.google.com/go/pubsub/pstest/fake.go
k8s.io/helm/pkg/helm/fake.go
k8s.io/client-go/tools/record/fake.go
k8s.io/client-go/discovery/fake
k8s.io/client-go/kubernetes/typed/core/v1/fake
... etc

However there are also third party libs for fakes such as 
https://github.com/fsouza/fake-gcs-server

Use custom built Fakes
If there is not an existing fake or it doesn't fake what you need, then for Google libs the APIs you need to replicate are not simple and you may want to simulate a number of methods for your tests. So manually creating the fake and maintaining its API against the real Google API becomes too much work, compared with autogenerating mocks. 

Google have sensibly anticipated this and hence released
google-cloud-go-testing
This package provides the full set of interfaces of the Google Cloud Client Libraries for Go
 there is no need to generate mock partial interfaces or maintain your own fake versions of its APIs.

As an example it can be used to create a fake GCS service, where data is just written to memory (in the global bucketStore variable)

The test substitutes the FakeClient for the real storage client. In order for the code to accept the real or fake client as the same type the library provides an AdaptClient method so both conform to the storage interface (stiface).
c, err := storage.NewClient(ctx, option.WithCredentialsFile(apiCredsFilename))
client = stiface.AdaptClient(c)


K8s Go integration Testing

For integration tests you ideally want to use the real dependencies, but if they are too slow or costly then they may well be best replaced by emulators.

Using gcloud emulators

Google also provides a number of full emulators to cater for speedy local integration testing,
https://cloud.google.com/sdk/gcloud/reference/beta/emulators which cover bigtable, datastore, firestore and pubsub.
So as part of your integration test setup you can can fire up the datastore emulator for example 


> export DATASTORE_EMULATOR_HOST=localhost:17067
> gcloud beta emulators datastore start --no-store-on-disk --consistency=1.0 
                                        --host-port localhost:17067 --project=my-project
The datastore client can then just be connected to the emulator for testing
client, err := datastore.NewClient(context.Background(), "my-project")
Using EnvTest and a local K8s API server

The EnvTest package creates a Kubernetes test environment that will start / stop the K8s control plane and install extension APIs. The K8s API server (and its etcd store) is by default a local emulator service (although it can also be pointed to a real K8s deployment for testing if desired).

EnvTest is wrapped up as part of Kubebuilder which is the primary SDK for rapidly building and publishing K8s APIs in Go. 

EnvTest caters for testing complex Kubernetes API calls of the type that might be required for testing a K8s operator for example. Hence when generating code for building an operator,  kubebuilder uses the controller-runtime in its boilerplate for running this up for a template integration test.



Summary

So if your K8s Go code is tested only by mocks for unit tests and running up a real Kubernetes cluster for integration tests, then maybe its time to re-evaluate your testing approach and start using the tools for fakes and emulators that are available. The only issue is that they are quite numerous with a mix of sources, so picking the right mix of Google internal lib, Google or 3rd party test package or custom built fakes and emulators becomes part of the task.