Optimising cloud costs with excellent engineering

March 3, 2021 Dirk Anderson

I have listened to cloud v data centre TCO debates play out for over a decade now and it can be a pointless and futile debate. In many scenarios, those advocating one platform versus another are not really drawing a full comparison between cloud and data centre costs. You may be paying for every unit of cloud cost by the minute but those costs include every aspect of the platform that those services are deployed on including the compute, storage, network, storage fabric, data centre racks, DC cooling, DC fire suppression, DC specialised power, security systems, the bricks and mortar, the automation, the self service portals, and the life cycle management processes and teams that manage the environments end to end. When I see comparisons being made I rarely see a complete picture representing a true cost of owning traditional data centre facilities. That's no-ones fault because running global data centres at hyper-scale and providing them for everyone to use, is simply not what most in-house technology team do for a living.

Having said all that I also realise that there are many scenarios where simply lifting and shifting workloads from traditional data centres to cloud simply won't stack up unless you can shed all the costs of running the platforms those workloads are being migrated from and that can be very difficult. So one answer is to optimise cloud costs through elegant engineering. When building out cloud platforms and planning to migrate workloads and services we all know the "low hanging fruit". Turn off VM's when not in use, right-size VM's, smart use of reserved instances, use scale-sets and so on. This gets you some way. But the long term answer is to make broader use of PaaS, container, serverless and micro-services.

Yes, I understand that this brings its own challenges. To name a couple, your business may have a strategy to avoid some of this due to "vendor lock-in". You may be a large regulated bank and must be able to demonstrate a defensible exit strategy and approach to your regulators. Elegant cloud, PaaS and container platform engineering is a big part of the answer. Make that investment up front and avoid the long term operational overheads of not doing so, not to mention the business risks. So what does good cloud, containers and PaaS engineering look like? Here are just a few attributes you should look out for:

Make sure you have built rock solid cloud environment foundations.
Build cloud, container and PaaS infrastructure with code where possible, especially for frequent, routine processes (TOIL).
Make sure you can build your cloud platforms in other viable cloud regions using the same code base.
Plug your IaC codebase into CI/CD pipelines and manage your code the way enterprise grade developers manage theirs.
Employ hardened builds and configurations where you can.
If using IaaS then use cloud-native IaaS.
Monitor those builds and configurations for a compliant desired state and automatically remediate non-compliant configurations where possible.
Deploy apps and services onto the platform with code.
Ensure excellent test coverage of your code when building applications and services using the broad array of cloud services.
If you need to show you are not “locked in” to a cloud hyper-scaler then use the rich array of PaaS services on offer but deploy them using code, think through how you would re-factor and re-deploy elsewhere using alternate cloud or data centre services. Container platforms provide an abstraction layer that becomes your ally here.
Keep your data out of these platforms so they remain stateless and manage the storage and data layer with its own software defined characteristics.
Manage the lifecycle of these platform services using code where possible.
And so on….

Depending on the size and availability of your technology team, this can seem like a mountain to climb but getting this stuff right at the start will pay huge dividends in the long run. As part of our standard Azure and container readiness assessment we include some cost optimisation automation checks alongside the CIS benchmark and general Azure platform health checks we do. This automation gives us quick insight into the overall platform health posture and allows us to make a quick and valuable impact to our customers existing deployments. These checks are geared around identifying key gaps that allow us to make plans we can work on quickly with customers and help optimise both the platform architecture and the associated costs.