Wednesday 5 July 2023

Sustainable Coding, and how do I apply it to myself as a Cloud engineer?

 I work as a developer of a Cloud service, Big Animal - EDB's Cloud Postgres product. So I went along to a meetup the other day, a panel discussion on Leveraging Cloud Computing for Increased Sustainability

It got me thinking about this whole issue, and how in a practical sense I could do anything that might reduce the carbon footprint of the service I work on. 

The conclusion I came to was that I don't really know ... and to some extent neither did the panel, so Cloud computing may give you some fancy tools to help assess these things, such as Microsoft Sustainability manager. But there are no black and white answers as to what would make something more sustainable - even the basic one of - run it in the cloud or on prem,  very much depends on what you are running and how. For one or other to work out as the more sustainable.

So on a global scale just how significant is computing as a percentage of global energy consumption and emissions?

The Cloud Climate Issue

Comparing today with 30 years ago is useful in terms of seeing where we are going...


1990s vs 2020s IT as a proportion of global energy and emissions

  • 1990s Energy 5% (Most from Office desktop computers and CRTs)   - 2% emissions
  • Today 8% (Most personal devices, laptops and mobile, includes 2% data centres) - 3% emissions

  • Compute power / storage  is around 30,000 times greater (by Moore's Law)
  • Data 16 Exabytes (EB) has grown to 10,000 EB so 600 times, with the majority in the last 3 years

Today data centres (hence the Cloud) is causing 2% emissions, as much as the whole of IT in 1990 and as much as today's aviation industry.

So working as a cloud engineer looks like a poor choice for someone concerned about climate change!

But on the face of it we have been pretty efficient, our compute and storage has massively increased, yet consumption + emissions only by 50%. But the issue is the acceleration in usage, which means we could double energy and emissions in 20 years, if nothing was done to improve sustainability.

The increase in compute power has remained fairly consistent since the advent of the transistor making Moore's Law, more a law of Physics than of human behaviour. Although of course that technology is now at its limits of miniaturisation. So the energy and emissions consumed per Gigaflop of compute has drastically dropped - but now everyone has the compute power of a supercomputer in their pockets.
The first supercomputer to reach 1 GFlop was the Cray in the 80s, by the 90s an IBM 300 GFlops supercomputer beat Gary Kasparov at chess - today a Google Pixel 7 phone is 1200 GFlops.
Hence our consumption has rather outstripped our increase in compute.

But it is the explosion in data that is a story of human behaviour.  Hand in hand we have reduced costs of cloud storage and monetised personal data. With software companies valued based on how many customers, and more importantly how much customer data they have. Recent advances in AI have proved the value of big data lakes for training models to produce practical ML applications. 

Combine that with the problem of induced demand. The more and bigger roads you build, the more traffic you get. Cloud puts a 6 lane highway outside everybody's front door.

How do we measure sustainability

So within the world of commercial sustainability, and carbon off setting, there is a basic concept  to categorize things as scope 1-3 emissions.

  1. Scope 1 covers emissions from sources that a company owns or controls directly.
  2. Scope 2 are emissions that a company causes indirectly and come from where the energy for the services it purchases and uses is produced.
  3. Scope 3 encompasses everything else. So suppliers energy use, etc.
The assumption is that raw energy consumption is not the issue. It is the generation of climate changing emissions to generate that energy that is the metric.  

This includes mining for minerals to build laptops and data centres, etc. But if you run your own green energy solar farm next to your data centre, and that is direct powering, without any significant battery storage. Plus feeding energy back to the grid, you can be pretty much carbon neutral. You can also fund renewable energy projects and offset. 

So strangely perhaps, given how full cooling can treble hardware life span. The biggest data centres in the world are currently built in the world's desserts rather than at the north pole. Solar and wind can be relied upon for more than 100% power.
  • Microsoft Azure was carbon neutral in 2012. It is aiming for 2030 for its whole business (then for 2050 to removing all its carbon debt since it was founded in 1975)
  • Google Cloud became carbon neutral for all its data centres in 2017, also aiming for 2030 for all its business.
  • Amazon is aiming for AWS cloud neutral by 2025, and as a global retail supplier to do the same for its whole business by 2040.
Of course this is not possible in most European countries, so most carbon neutral data centres in Europe will be from purchasing carbon neutral generated energy, rather than actually being neutral in themselves. Although some go a long way down that road, to partner with renewable energy suppliers and tick a number of other sustainability boxes. The problem is if data centres are buying up lots of the renewable energy supply at a premium, then that means they are removing it from residential or other uses. So this is hardly helping global sustainability and in reality means they are far from neutral.

Also carbon neutral means only that scope 1 is covered. Net zero is a standard above carbon neutral where to deal with scope 2 and 3, emissions must be taken out of the atmosphere. So that in practise only a net zero supplier is actually contributing nothing to climate change.  No cloud provider is net zero.

A key point is that the latest enormous scale cloud provider data centres are not the main source of emissions, it is all the older, smaller, more local data centres and machine rooms of servers that are causing the majority of the emissions. In the same way that car pollution is disproportionately down to older vehicles. Of course there is the manufacturing footprint to consider for cars that can last 40 years, but all computer hardware has a much shorter lifespan of 3-5 years. Obsolescence makes increasing the lifespan uneconomic. Another green issue that could fill a blog post on its own.

So moving to cloud provider's services and migrating any remaining on prem to the cloud is the sustainable thing to do, as long as what is moved is suited to cloud, or can be re-architected for the cloud.

What changes, as a developer, could improve sustainability?

Life Style

So the obvious thing that people think of, is the nature of their employer's work. Or perhaps if your company is a B2B one, do they have green standards wrt. the clients that they work with. For example it may not make sense working for ExxonMobil as the company with the world's largest emissions. Perhaps the tech industry equivalent would be working on Crypto currency? But Blockchain developers are working on that reputation, even coming up with useful uses for it!, such as auditing sustainability usage for scope 2 and 3 verification. 

Over half of internet traffic these days is video streaming, so stop watching Netflix and scrolling on TikTok - and read or listen to books instead is maybe a good behaviour 😉
On the plus side porn has dropped from its high as 25% of internet traffic down to around 10%, but it has been more than replaced by cat and side hustle millionaire videos it seems. So if your side hustle is being a prolific social YouTuber, it may not be the most ecological of life choices. Since an hour long short story of digital text is 100 Kb whilst the same as a 4k video is a million times bigger at 10 Gb.

On a personal level, my previous employer was more office orientated. So it was keen to encourage people into the office with free food etc. it encouraged commuting in to work, and the maintenance of offices with permanent desk space for every employee, monitors, heating etc. and all the unnecessary extra emissions that entails. My current one is more remote-first.

In terms of remote work, having experienced pandemic lock downs in a city. I was going out for a regular cycle for exercise, so I can confirm that the reduction in emissions may have only been measured at 20% across the whole of Britain, but in the cities it felt more like 50% - the air was so much more breathable.Whilst maximising WFH is not equivalent to pandemic lock downs, it does make a difference. So changing jobs in the tech sector, to a full-time remote position, is certainly a worthwhile contribution to sustainability.

There is the argument that if we all lived alone in big drafty castles which could be turned off for the day by packing into an office a walk away. Then remote working is not more sustainable. But the reality of IT work today, especially with hybrid working, is that the big fairly empty building you are more likely to be in these days, is the office.

So become full time remote if you can. If you have to work for an office based employer then choosing one that has hot desking, smaller offices, less frequent attendance and live within walking or cycling distance, are all part of being sustainable wrt. your tech job.

Sustainability for a Cloud SaaS company

I work for a company that produces a cloud market place software product, with most engineers working remotely and running no servers at all, just employees laptops, ie everything we run is via cloud providers services. We have a few offices globally but only a minority of engineers use them. Since all teams are largely remote, there is no office, no paperwork, commute or physical products.

So the same applies to all our other services, eg. from CI to presentations. From LaunchDarkly to our CRM. From expenses to online mental health support etc. Plus Slack and Zoom for comms.

This is a pretty common model, your could call it a server-less company, it was the same at my previous employer. We sell SaaS and we use it for everything internally too

Therefore the assumption is that the problem of working out 2 and 3 should be solved by those cloud providers, which to some extent it is ... maybe some less than others. But emissions data can be obtained for scope 2 and 3 from them.

So that leaves scope 1. This may be hugely affected by how much face to face sales and marketing goes on etc, but that is not my area. So I am purely going to focus on what options are there to improve sustainability wrt. to the software architecture, development and deployment practises available for producing a cloud based software service, SaaS. Since those are the areas that as a software engineer I can influence.

So lets break that down to some basic elements, and work out what are the more sustainable practises and approaches.

Cloud vs. On Prem

So first things first. Is working for a company that runs everything on cloud, and delivers a cloud based product a good thing, versus writing software for running in a local server room or data centre?

Assuming you use one of the big carbon neutral cloud providers and are using virtualisation to scale capacity efficiently with usage. Then it is likely that a Cloud data centre will be run much more sustainably than a local data centre where you may house your own servers and certainly a local machine room. So even if you are running a specialised HPC data-centre where the majority of traffic is local ... third party providers will be able to offer more sustainable options.

Of course if you software is entirely unsuited to cloud virtualisation (k8s micro-services etc) or badly designed for it, you could actually be running up way more resources than a local monolithic solution on a few dedicated servers would. So sustainability goes all the way down through the architecture to the lines of code, and what they are written in.

A whole load of legacy software dumped onto the cloud can be less sustainable (and way more costly) to run than running it locally. 

So another sustainable employment decision, is to not work for an organisation that either has a lot of legacy software or has its  own servers or data-centres, or at least only if they are bigger than a soccer pitch (ie average DC size or bigger) and have their own adjacent wind farm or other local renewable power source.

But if like my employer, everything is run on the three major cloud providers, and there is very little in the way of scope 2 and 3, then is the sustainable business box ticked already?

Unfortunately not, as mentioned, they are not net zero and ~2% of global emissions are from running data centres, so whilst that may be disproportionately from ones that are not the self powered giant DCs used by the big cloud vendors. Being as efficient as possible wrt. use of Cloud is still the key to being a sustainable tech worker. Especially with the projected growth in Cloud and its emissions being a significant ecological concern.

Choice of software languages


So the reference paper often quoted (and misinterpreted) for software language sustainability is this Portuguese University paper on Energy Efficiency across Programming Languages.

Where we could perhaps regard sustainability as the combined goal of minimising energy = performance, time and memory usage (table 5. in the paper). So leaving out the older / less mainstream languages we have ...
  1. C, Go
  2. Rust, C++
  3. Java, Lisp
  4. C#
  5. Dart, F#, PHP
  6. Javascript, Ruby, Python
  7. TypeScript, Erlang
  8. Lua, JRuby, Perl
So on that basis we should write everything in C or Go or possibly Rust, maybe even Java if we are not that eco-friendly.

Whilst I do use Go for writing Cloud micro-services, I think the paper's focus on executing a few specific algorithmic performance tests is maybe not an entirely representative approach.

I have been a Python developer for 20 years and Python is ranked almost last for speed. 75 times slower than C at the top spot. But even if this were the case across all uses, then it ignores the fact that for compute heavy tasks where Python is employed in number crunching, it uses high performance libraries for the core processing functions. So Numpy is half C and runs all the big matrix manipulations in C.

Hence the API coding and setup is in Python but it is not actually running everything 78 times slower than C, it is running maybe at worst, half the speed of a pure C program. Plus that custom pure C program could well have taken a lot longer to write and be less reusable, so in total use way more energy than a Python version would. Especially for short lived code and Jupyter interactive coding orientated use cases such as used in the science and finance sector. 

There are further optimisation approaches such as Numba, when Python is being used for fast computational use cases which can compile straight to CUDA machine code for GPUs. 
A paper comparing Java, Go & Python for IoT decision making. Similarly puts Go at the top for efficiency, but places Python above Java (presumably Python was using SciKit hence C for performance critical algorithm execution). So clearly the use case and the methodology of the study, can make a huge difference in the measured efficiency.

The same could probably be said for a number of the other languages languishing at the bottom of the table. If measured for executing a real world use case rather than a pure language implementation, the results can be much improved.
However for very nimble light weight micro-services then a directly compiled language like Go is going to use less resources than languages using JIT VMs and/or an interpreter. 

Then there is the core point that most applications in the cloud are not highly intensive calculation based ones. The performance of the majority of applications are more likely to be due to the data I/O on the network between services and storage. Where raw algorithmic performance has little impact.

What does matter is that running up parallel processes is simple and lightweight.
That core feature, along with the simplicity of Go and its small footprint were designed specifically for cloud computing. Which means, becoming a Go programmer, or at least learning it. Is a good choice for the more sustainable programmer.

It is also why ML/Ops will often use Python at the development and testing stages of ML models, but then switch to Go implementations for production.


Software Architecture


The architecture that is deployed to cloud has a huge impact on the efficiency of a cloud service, and hence its sustainability. Certainly it is going to have much more impact on energy wastage than the raw algorithmic performance of the language used.

The architectural nirvana of cloud services are that they are composed of many micro-services, each managing a discrete component of the service's functionality and each able to scale independently to provide a demand driven, auto-scaled service that ramps up and down whatever components are required from it at any given time. Morphing itself to provide always sufficient capacity. Not needing stacks of wasteful hot failover servers running without a job to do. Not getting overloaded at peak and failing to deliver on uptime.
The ideal sustainable use of hardware, always just enough. Virtualisation allowing millions of services to ramp up and down across the shared Cloud provider DCs vast hardware farms.

Clearly, combined with Big Cloud using the latest carbon neutral DCs, this ideal is much more sustainable than each company running its own servers and machine rooms 24/7 on standard grid non-renewable power, for a geo-local service that only approaches full capacity twice a day, and could probably be happily turned off 6 hours a night with nobody noticing.

From this perspective, one the big cloud vendors are keen to promote, Cloud is the sustainable solution not the problem.

Unfortunately that ideal is often very far from the reality.

Software that is essentially monolithic in design can end up being lifted and shifted to the cloud with little refactoring. At best the application is maybe chopped up into a few main macro-services. UI, a couple of backend services and data store as another. Then some work done to allow each to be deployed to Kubernetes as pods with 3 or more instances in production. Ideally the replicas are identical in role and have good load balancing implemented, or multi-master replication for the storage. But often primary-replica is as good as it gets.

Essentially an old redundant physical server installation with a few big boxes to run things is being re-implemented via k8s. Then repeat that per customer, usage domain, geo-zone or whatever sharding is preferred. Big customers get big instance's - the providers have wide sizing ranges for compute, storage etc.

Its better than just setting up a VM to replace each of your customer's on prem boxes - and basically changing almost nothing from on prem installs, but any increased sustainability is only that provided by the Cloud vendor's DCs. The solution is not cloud scale with auto-scaling, its repeated enterprise scale with a lot of fixed capacity in there.

For these cases maybe consider swapping out some elements with a cloud provider scaled service, eg the storage. Whether that is by using the Cloud provider's solution or a third party vendor's market place one.

Even for software that has been freshly written for the cloud there can be architectures that consume excessive resources and are overly complex, some times because of the opposite issue. So with the budget to rewrite for cloud, then developers can leap too fast for all cloud scale solutions - when the service has no need of them. For example deploying multi-region Kafka for event streaming and storage, when data could happily have been sharded regionally and put into a small Postgres or MariaDB cluster. 

Repeatedly firing up a 'micro-service' k8s job that is very short lived but uses a big fat monolith code base, so that 80% of the time and cost of the job is in the startup. This is where language matters more, the lighter and faster the language, the smaller the binary and its memory usage, the better.

The use of gRPC between micro-services provides 10 times the speed of REST, which can be reserved just for the whole service API to the UI and CLI.

One key indicator of waste is the obvious one of cost. If your new cloud deployed application is generating usage costs that work out far more expensive than the TCO for its on prem deployment. Then its architecture is not fit for cloud use. You are burning money and generating excess C02.

Sadly with architecture it all depends what suits the scale and use cases of a service. So there is no simple fix it advice here.

Development, Testing & Release practises


Testing and release are probably the most important area of Cloud software development that could benefit from more sustainable practises. This is perhaps more a pitfall of the rise of Docker and infrastructure as code, rather than Cloud itself, but the promise of replicable automated built software environments has delivered. 

What it has delivered is a development to production life cycle where developers can spin up any number of their own development environments - even one per Jira ticket, automatically built on its creation perhaps.
In order to get merged with the release code your team choose to run the full E2E suite. It takes a little while, but we can speed it up by running the 5 clusters we need in parallel for each test environment case. These also standup the whole environment, load it with fixture data and run E2E tests on it, maybe some infrastructure ones too, that failover the storage and restore from backup.
But at least they should automatically teardown the test clusters at the end, where as dev clusters can hang around for months without cleanup.
Then once it passes it goes out to the dev environment which has its own testing barriers for release to staging. Staging should be the same hardware resourcing as Prod so that it properly tests that it is working for it, perhaps with some load testing or maybe that is done in another set of clusters.
Finally it gets to roll out to production, but maybe for safety prod has a set of canary environments it goes to first, for final validation before it can be rolled out to customers.

So to get 20 lines of code into production. We could easily have a process like the above, that involves spinning up over 10 temporary k8s clusters and uses hundreds of longer life ones. Just running the E2e and infra tests will take over an hour.

This is seen as good practise in the Cloud world. Rigorous testing before release to production. It is pretty common for companies producing a cloud service. Since most software companies now have to produce a cloud version of their product to satisfy the markets then that is a lot of companies. For the first year or so, all this will be run at the cost of millions of dollars, with hardly any customers using it. Because that is what you do. Agile, get the product out, then grow and refine it and the team developing it. Build it and the customers will come.

This is a hugely wasteful process, and it is not far from Crypto in terms of generating emissions, for something that has no practical use yet.

If we do end up with a lot of customers fine, but for services that are not multi-customer architecture, ie big revenue small customer numbers, there may well be customer specific customisations of the product ❄❄snowflake alert❄❄ So the easy option is the duplication of as many clusters in dev and staging as are in prod, to cater for fully testing for those big clients. So a great deal of duplicate resource spend.

So there should be a lot more consideration of sustainability when establishing the above practises for the development to release cycle.

One way to address this issue is to push as much testing as possible down the testing pyramid.
Unit testing is less useful for cloud since the whole point of Cloud and micro-services is to do only one thing and knit together via API calls the full service. Which means there may be very little functionality that can be tested by a unit test, since everything needs to be mocked.

However that doesn't mean that things cannot be faked, fakes allow fast functional testing of micro-services. Fakes can mean the full emulator's of services, eg. Google pub sub. Or running your gRPC services over its test fake, local memory, bufconn.

But the aim should be to establish a full fake test framework that can run up your service on your laptop. Ideally without the need of a k8s fake like kind to stand it up. Since we don't want to fake the deployment environment - just the running code. Functional tests can then be written that can be used like unit tests to check PRs pass in seconds as part of a git workflow. Running those same tests at regular intervals against full deployments can validate that they correctly mimic them.

There should be layers of tests that validate code before the E2E test layer and do not just have unit and E2E, since then the validity of the code relies on full deployment testing. Full deployment testing should just be run as part of the release process,  it should never be run at the PR validation level, it takes too much time and energy.

Developers can have reasonably long lived personal dev clusters not one per PR, maybe even resort to shared dev clusters per team, to reduce spinning excessive amounts of cloud resource for development.
Automated shutdown based on inactivity should be the norm.

Time should be invested in developing good sample data for non production environments. They should not resort to duplicating all customers, regions or whatever sharding. Plus a bunch of test versions of them. If you have more things running in dev than in prod, you are doing things wrong.

Another route to take is to only have production for long lived deployed clusters. With temporary clusters for automated testing and the use of feature flags to cater for final stage testing in production sandboxed feature enabled clusters, prior to full release. So this separates deployment from release - the latter can then be moved outside of engineering, once a flag has passed testing and validation.

Temporary clusters can use tools such as vcluster for automated short lifespan k8s clusters, significantly reducing the resource usage and speeding up the spin up time, for dev clusters. Hundreds of pseudo separate k8s clusters for dev and testing can be run in a single k8s cluster.

Anything else?

The explosion in data is not just all video streaming. Observability is a huge topic, the amounts of telemetry and logging that a well SRE engineered service needs can be overwhelming. Clear management of that, and limits on retention (at least outside of cold - ie tape / optical - storage) are essential. Such things as the ability to turn on higher info debugging levels for very restricted sets of environments. Provide valid ML learning data sets without filling up data lakes of hot storage, etc.

There are still so many more things that impact Cloud sustainability in terms of Cloud applications ... however this blog post is already unsustainably long 😀. So I think I should end it here.

The main point is Cloud can be the sustainable option, but only if cloud engineers put in the effort to make it so, by pushing for the most sustainable architecture, development and release practises in our every day work.





1 comment:

  1. Thank you for this! Very thoughtful and detail oriented. It makes so much sense yet news articles keep posting that Python is so slow so people should not use it...

    ReplyDelete