BlakYaks

View Original

Journey to cloud native

This blog shares key considerations for organisations (particularly technology infrastructure teams) as they navigate the transition to cloud-native enabling technologies. There is pent-up demand across the industry from application development teams wanting their infrastructure technology colleagues to provide tech tools and services that enable them to build micro-service architectures. These architectures depend on and are enabled by containers, platform-as-a-service and serverless technologies. Business leaders want new and improved digital and online capabilities. They also want higher cadence releases of technology features and functions.

A cloud-native strategy, when executed well, has the potential to deliver significant advantages including, but not limited to:

  • Significant business cost savings versus traditional "server hosting" methods.

  • Increased speed and higher cadence application release, delivery, upgrades.

  • Improved time to market for digital business services.

  • Reducing errors through embracing small, rapid, frequent code and platform changes.

  • Improved test coverage leading to more stable and predictable platform and application releases.

  • Modular, de-coupled architectures that are easier to manage at scale leading to additional cost savings and more agile change/release processes.

  • Blending cloud provider PaaS services and open-source technologies to deliver rich customer experiences at the right cost profile.

An adjustment for Infra teams

There is however a challenge that infrastructure teams must prepare for along this journey that is not insignificant. Senior leaders must also recognise that the industry sea change we are seeing now is one that we see approximately every 15 years or so. We are making a slow but sure shift away from Infrastructure as a Service as the preferred hosting method, towards PaaS, containers, serverless and SaaS.

Around 30 years ago mainframes started getting replaced by physical distributed client-server platforms. Approximately 15 years later, virtualisation / hypervisors started replacing physical deployment methods to improve consolidation ratios, centralisation, and management. While containers, PaaS and serverless technologies are not new to many in our industry, especially application development teams, IaaS has remained the dominant deployment method, especially for on-premises scenarios. Even in cloud environments like Microsoft Azure and AWS, both offering a huge array of PaaS services, IaaS has been prevalent. There are several reasons for this, including but not limited to:

  • Lift and shift - many enterprises have adopted a “lift and shift” approach for migrating services to cloud (IaaS to IaaS).

  • Before Kubernetes came along, Container technologies were notoriously difficult to manage at scale so while they became very popular with application development teams, they were naturally less appealing to IT Operations teams.

  • Enterprise IT operations teams have been trying to get to grips with managing large sprawling IT estates that straddle on-premises and cloud sometimes across multiple cloud locations and vendors. At the foundational level, being able to manage a variety of deployment venues, including cloud and on-premise, is a fundamental and pre-requisite step to managing a wider array of new cloud-hosted technologies.

  • The eco-system for these emerging technologies has been maturing slowly (but surely) and the industry needed to get to a point of maturity that enabled large scale deployments that could be managed efficiently throughout their whole life cycle.

Many business leaders now realise that embracing these technologies is critical to their ongoing success when competing in the digital marketplace. The industry is reaching a point of inflection where many of the major barriers have been overcome and technology teams are now more prepared than ever to adopt new platform hosting methods at scale. Leading industry analysts are predicting a significant shift towards containers, PaaS and other “micro-services enabling technologies” over the next 2-3 years. Many large businesses are reaching a stage where they are already managing production implementations of general-purpose cloud platforms. IT ops teams have re-skilled accordingly to manage hybrid foundations. Technology teams are now planning to deliver and manage integrated container, PaaS and serverless technologies.

Organisations are at different levels of maturity, but generally the investment and re-skilling to prepare for the next phase of the journey is underway within the sector.

Now that many businesses have mastered running foundational hybrid cloud services, the next hurdle for enterprise IT teams is how they fully prepare to support PaaS, serverless and containers in a variety of deployment scenarios at scale.

MILESTONE 1 - MASTERING NEW TECHNOLOGIES AND METHODS

The shift to Integrated containers, serverless and PaaS requires new thinking when it comes to operating at scale. There are some “musts” that infrastructure teams should prepare for:

Hybrid, cloud and multi-cloud platform management - Ensuring that IT Operations teams are ready to manage large hosting environments that straddle existing data centres and cloud data centres. This is a key foundational step and should already be quite embedded into operational frameworks. This means establishing an operational wrapper around these environments such that it is normal to deploy and manage workloads and services in multiple deployment venues. This part should already be (or about to become) “business as usual” for the IT Operations functions. This is a key step that goes beyond simply establishing cloud landing zones.

Establishing “Infrastructure as Code” excellence. IaaS platforms (even cloud-native IaaS) are relatively static in nature. Container and PaaS services are very fluid by comparison. These services are designed to scale vertically and horizontally. While this scaling feature is often built into these technologies, its important for IT Operations, IT Engineering, and site reliability engineering (SRE) teams to master the ability to do this elegantly with quality automation. The sheer number of service commissions, de-commissions and routine upgrade events will rise significantly and infrastructure teams will need to do as much of this as possible with good automation.

Cloud security, identity and access management. As an industry that has evolved over decades, our modus operandi generally was to build “hard security shells” around our infrastructure platforms. We focused primarily on keeping external threats outside corporate walls. The era of cloud in all its forms has driven us to tackle the security and identity challenge in new (and often improved) ways. Multi-layer security processes and clearer imperatives around the need to manage identity, authentication, authorisation, and secret management processes much more robustly is key. Single sign-on technologies along with Identity and access management tools and processes are increasingly important for businesses to get right.

Software defined networking. Historically we have built physical networks largely comprising layers of network infrastructure at the core, distribution, access, WAN and edge layers all “wired” together with definitions of virtual lans, subnets, firewalls, gateways, routers, proxies, and load balancers. As we move services to cloud and distribute our services, data, and networks wider, there are increased levels of separation and control that will exist between locations, regions, environments, business units, applications and services. This increased granularity in the networking architecture demands improved software-defined network engineering and operations to manage these environments efficiently.

Software defined storage. Managing storage efficiently is critical for the stability and performance of these new architectures. A key step in the design process is gaining a full appreciation of the required resilience, availability, recovery and performance characteristics of the services that will rely on these cloud-based storage layers. Unlike virtual machines, traditional shared storage platforms (e.g., Block and NAS) cannot be simply lifted into cloud in their current form. Cloud native storage exists in a variety of constructs from low-end capacity disks to high performance SSD, NVMe and all of this is best deployed, scaled, managed in a software defined approach to ensure you optimise the costs/performance/availability mix.

Container technology eco-system integration (e.g. Kubernetes, Rancher, Mirantis Docker EE , Openshift) – If these technologies are part of the technology strategy there is a significant eco-system of ancillary technologies that serve to make these environments more manageable at scale. Examples are Portworx, Calico, Prometheus, Grafana, Hashicorp Vault. These technologies will need to be elegantly integrated, deployed and managed with code.

MILESTONE 2 - ELEGANT AUTOMATION AND INTEGRATION

Elegant automation and integration at all layers, will become the hallmark of well-architected, well-engineered, cloud-native platforms that enable businesses to accelerate their digital ambitions. This requires IT teams to take existing automation practices to a new level. Traditional IaaS platforms are comparatively static in nature and there are real benefits in simply doing an excellent job of automating the deployment of virtual machines in a standard way. Virtual machines (IaaS instances) are typically deployed for weeks/months/years depending on the role they perform. The platforms and technologies that underpin micro-services architectures such as containers, serverless and cloud-native PaaS technologies will be deployed for minutes, hours and days. While some of this will happen simply because of making the right technology choices and creating good enterprise designs, there are a number of additional considerations for IT teams that need to operate these platforms at scale. This includes but is not limited to:

  • Collecting logs in a variety of forms for real time and historic analysis becomes more important. There are some situations where the presence of services at certain times is only apparent because of log data (e.g., where service components come and go intra-day).

  • Attributing granular costs to specific cloud service components across business services, cost centres, locations and environments is a critical part of this fluid architecture.

  • Mapping service architecture components to business services within service management platforms and being able to understand if/when services enter a degraded or down state is key. Knowing exactly which architectural components may have contributed to the change in service state matters, as does knowing whether the responsibility sits with you, a partner or the cloud provider.

  • Designing and automating architectures that are self-healing is a great principle to set out with.

  • Managing a variety of new routine maintenance activities throughout the service life cycle in a non-disruptive way, so the platforms that underpin business services remain up-to-date and fully supported by the vendors. Some of these maintenance activities will happen automatically and are carried out by the cloud service providers (CSP) with zero disruption. Others however require internal IT teams to dovetail their operational processes to the CSP’s to handle non-disruptive upgrades and maintenance in a variety of shared-management, shared-responsibility scenarios. Many of these can be handled automatically and gracefully to avoid disruption but need to be planned for.

MILESTONE 3 - OPERATING AT SCALE

At a high level, when we reflect upon some of the key operational processes that exist for teams managing large IaaS estates, the following are common example processes:

  • Managing one of more types of hypervisor types (e.g., ESX or HyperV) configured into a variety of hosting clusters that IaaS VM's will reside upon.

  • Managing a variety of Virtual Machine images pre-packaged with a standard software bundle that IT support. These images are typically variants of Windows and Linux images.

  • Managing processes that harden VM’s to meet specific security and configuration standards and maintain compliance state.

  • Managing software deployment engines and software packages that deliver COTS and in-house software bundles along with drivers, firmware and the like to target virtual machines.

  • Managing a service catalog of standard packages that can be offered and deployed automatically to IaaS VM's.

  • Providing security software management engines that keep security patches, service packs, virus engines and the likes up to date.

  • Providing monitoring and alerting solutions that trigger events based on conditions that require attention.

  • Providing enterprise logging and analytics engines for real time and historic log data analysis and event management.

  • Then you have all the core supporting technologies that underpin all the above such as directory services, networking, DNS, routing/firewalls etc.

Container and Serverless specific operational considerations

When you add a standard set of PaaS, Serverless and container technologies that are quite fluid into the mix, then you have operational considerations around the following (examples only):

  • Secret management should be automated.

  • Container specific storage management processes that automatically adjust storage characteristics to suit different service level requirements.

  • Container specific networking technology management processes.

  • Container registry management and governance of the associated images to cover a variety of deployment scenarios.

  • Granular level tagging of resources for accurate cost allocation.

  • Dashboard, reporting and other management information processes for capturing services as they come and go, scale-up, scale-down, scale horizontal.

  • New capacity management approach as the very nature of capacity as we traditionally understand it, has changed. It’s not as finite but nonetheless needs to be managed, arguably more robustly because of the direct correlation between cloud capacity used and the associated cost profile.

  • Improved code management processes for a growing base of infrastructure code (e.g., Terraform, Ansible, PowerShell).

  • Container management and orchestration processes and associated tooling.

  • Site reliability engineering (SRE) capability for larger businesses with sufficient capacity to reduce (ideally remove) the “toil” from routine technology platform operations, administration, and troubleshooting.

  • DevOps rigour and good standards are required to manage a growing “Infrastructure as Code” library if operational efficiencies are to be realised. Infrastructure operations teams can learn some good lessons from mature development teams and can embrace some of the requisite tools and processes (e.g., source control management, continuous integration and continuous deployment) that these teams have had to master over a decade plus.


In conclusion

We are in the throes of a major industry platform shift towards more containers, Serverless and PaaS and therefore less IaaS. It’s important to set ample time aside to plan well and invest in the appropriate areas if you want to navigate this change effectively from technology, operations and cultural perspectives. IT teams will need ample headroom to initially get to grips with new technologies and new ways of working. Senior leadership will need to get behind the change cycle and support it financially and as a fundamental change in architecture and way of operating the technology estate. Be fully prepared to adjust the operating model in several ways to accommodate a service and platform landscape that is, by its very nature quite changeable. Learning how to operate at scale, without any additional risk to service, is a critical architectural and operating model adjustment that needs to be made if technology teams are to be successful and efficient in executing a smooth transition over time.