Deconstructing Azure Red Hat OpenShift Deployment

Technical

18 Nov

Written By Craig Hurt

A technical blog that discusses moving the deployment of Azure Red Hat OpenShift from imperative to declarative code.

Code

You can find the code referenced in this blog over on our public GitHub repository here.

Introduction

We love a challenge here at BlakYaks.

If you're not familiar with ARO, it's essentially a PaaS release of the Red Hat OpenShift platform that has been tuned to run on Azure; the Red Hat equivalent of AKS if you like (more details here if you want them). ARO administration has been integrated into the Azure CLI application since initial release, and remains Microsoft's preferred approach for deployment of the platform to Azure. Now we're big fans of the Azure CLI, but it's not very declarative, and certainly wouldn't be classified as a pure IaC tool.

We wanted to bring the ARO deployment into our Terraform codebase as a managed resource. Currently, the azurerm provider doesn't support ARO, though there is some work in progress to change this in this fork of the provider. This only really left us with the option of wrapping the ARM template deployment into our Terraform code.

Preparing the Subscription

Microsoft do a great job of describing the precursory tasks for using ARO here so I won't cover these again here.

One thing worth noting however, is that upcoming changes in the API for ARO (in the 2021-09-01-preview release) also allow us to use host-level encryption features. These features aren't enabled in our subscriptions by default, so it's a good idea to enable them now if you're likely to take advantage of the feature later on (even though we enable the functionality, we'll keep this option disabled for now in the ARM template defaults).

Run the following az commands to enable for your current subscription:

az feature register --namespace "Microsoft.Compute" --name "EncryptionAtHost"
az feature list --query "[?contains(name, 'Microsoft.Compute/EncryptionAtHost')].{Name:name,State:properties.state}" # Wait for the State to show as 'Registered'
az provider register -n Microsoft.Compute --wait

ARO uses a fair amount of compute out of the box; even for a minimal cluster you're going to need 36 additional DSv3 cores, so check what you have available and increase your quota accordingly if you think this might be an issue. For example, we're deploying into a lab subscription in the UK South region, so we can check our balance first:

az vm list-usage --location "UK South" -o table --query "[?contains(localName, 'Total Regional vCPUs') || contains(localName, 'Standard DSv3 Family vCPUs')]"

CurrentValue    Limit    LocalName
--------------  -------  --------------------------
0               20       Total Regional vCPUs
0               20       Standard DSv3 Family vCPUs

In this case we would need to increase our limits to allow the ARO deployment to complete, so we'll just need to raise the request with Microsoft and wait for that to be actioned; normally they're completed within a few minutes if required.

Unwrapping the az command

Microsoft don't (directly) provide the details for an ARM-based deployment of ARO, and for good reason; if you were to run the deployment template exported from the ARO cluster resource object, you would have a failing deployment on your hands. The reason for the failure is that the az aro create command is also creating a dedicated service principal and making a few choice RBAC updates to your environment prior to spinning up the cluster.

During execution of the az aro create command, we can observe the ARM template that is generated to deploy the top-level cluster object. Although multiple sub deployments are triggered to create the virtual infrastructure, we can only control the top-level deployment - again similar to how we would interact with AKS. The ARM schema documentation provides any other information we may need, allowing us to build a fully functional ARM based deployment template.

In short, if we want to replicate the behaviour of the az command in our IaC, we need to take care of the extras before we attempt to create the ARO cluster via the ARM template. Let's look at the specific actions we need to complete.

Step 1: Create the virtual networking

ARO expects an existing virtual network to connect to, and it should adhere to the published standards. From an RBAC perspective, it's better to provide a dedicated VNET rather than a shared one, since the ARO service principals are going to need to update it on the fly.

For simplicity, in our example we'll create a single VNET with two subnets - one subnet for the master nodes, the other for the workers.

Step 2: Create the ARO service principal

There are actually two ARO service principals in play, one is created at a tenant level (Azure Red Hat OpenShift RP) when the Microsoft.RedHatOpenShift resource provider is first enabled, the other is dedicated to our ARO cluster.

It's worth pointing out that whilst running az aro delete will remove your ARO cluster, it does infact leave the service principal behind in your tenant if you let the initial az aro create command create this for you. Moving to an IaC deployment model will help keep identities closely coupled and avoid security headaches later on.

Step 3: Create the RBAC assignments

Once we have our service principal created and identified, we can make the changes we need to our Azure IAM. Since the majority of interactions will take place within the managed ARO resource group, there are very few other requirements for the service principals outside of this resource group:

Service Principal Scope Permission Azure Red Hat OpenShift RP Virtual Network Network Contributor ARO Cluster SP Virtual Network Network Contributor

We'll assign these permissions prior to commissioning the ARO cluster.

Deploying ARO with Terraform

Our Terraform code example is split between our root module and a single child module that wraps the ARO cluster deployment template.

In our code, we'll take care of all of the pre-requisites and finally the ARO deployment:

Create our ARO service principal, and referencing our tenant-level OpenShift resource provider identity
Create our ARO and VNET resource groups
Create our VNET and Master/Worker subnets
Apply the necessary RBAC permissions to the ARO VNET
Apply the ARO deployment via the sub module

In our ARO module, we pass parameters through to our ARM template deployment and let Terraform manage the deployment for us. When it's done, we use the az aro command suite to pull data from the newly built cluster and pass this back as outputs to the calling module.

We've provided a set of defaults that will get you up and running quickly, so installation is as simple as running:

az login
az account set --subscription <subscription_id or name>

terraform init && terraform apply

We're using an empty azurerm provider block, so logging in with Azure CLI will get us authenticated. Once it's running, the ARO resource takes around 35 minutes to deploy, so grab yourself a coffee.

Let's look at some of the interesting parts in more detail.

Public and Private Clusters

ARO clusters can be exposed on the internet or not, depending on your requirements; for a lab cluster we're probably fine with a public cluster, in production not so much. We're simplifying our deployment model in this example by exposing a single public_cluster (bool) variable on the ARO module.

If we select private, we'll assume that both the API and Ingress endpoints are to be kept private, inversely with public cluster we'll expose both. Since the ARM template supports splitting of the two endpoints, you could easily update this by changing the logic in the deployment resource, as highlighted below.

resource "azurerm_resource_group_template_deployment" "aro_cluster" {
    ...
    api_visibility      = { value = var.public_cluster ? "Public" : "Private" }
    ingress_visibility  = { value = var.public_cluster ? "Public" : "Private" }
    ...
}

If you selected a private cluster, you're going to need to ensure you can resolve the API and Ingress hostnames to the correct IP addresses; in a full deployment we'd setup Private DNS zones and integrate this with the deployment to take care of that. You can read more on the next steps in the ARO documentation if you're interested in seeing how this all works.

Terraform External Data

The Terraform externaldata source provides a simple method for gathering data outside of Terraform's direct control. It works by parsing shell output data as a JSON data source, which is then exposed to other resources in the Terraform graph. It's really only to be used as a last-resort, but in our use case it actually works well, since az command output is natively passed back in JSON format.

Following deployment of the cluster, we'll use multiple external data sources to parse return data from az; this information can then be passed as output from the module to downstream code or processes.

data "external" "aro_api_details" {
  program    = ["az", "aro", "show", "-n", var.cluster_name, "-g", var.aro_resource_group_name, "--query", "apiserverProfile", "-o", "json"]
}

I'm going to state the obvious here, but make sure that the az command line tool is installed on your workstation before running the Terraform code!!

External Data Leakage

There is one fairly major drawback to using external data sources from a security perspective, namely that any data stored in state will be displayed without censorship in the plan output. For example, when we run a plan against an existing deployment, the aro_credentials secret will be exposed for all to see:

# module.aro.data.external.aro_credentials will be read during apply
  # (config refers to values not yet known)
 <= data "external" "aro_credentials"  {
      ~ id      = "-" -> (known after apply)
      ~ result  = {
          - "kubeadminPassword" = "my-password-in-plain-sight"
          - "kubeadminUsername" = "kubeadmin"
        } -> (known after apply)
        # (1 unchanged attribute hidden)
    }

Currently there isn't a way around the issue natively with Terraform (any chance of adding a sensitive attribute to the provider Hashicorp?), but there are some third-party tools that can help, such as CloudPosse's tfmask tool, which does a good job of obfuscating the output if you're worried about this leaking into your CI/CD pipeline outputs. In our example, if we configure tfmask to support our sensitive output:

export TFMASK_VALUES_REGEX="(?i)^.*[^a-zA-Z](kubeadminPassword).*$"

And then re-run our plan via tfmask:

terraform plan | tfmask

We can now ensure that any nasty credential leaks are removed from our plan output:

# module.aro.data.external.aro_credentials will be read during apply
  # (config refers to values not yet known)
 <= data "external" "aro_credentials"  {
      ~ id      = "-" -> (known after apply)
      ~ result  = {
          - "kubeadminPassword" = "***********************"
          - "kubeadminUsername" = "kubeadmin"
        } -> (known after apply)
        # (1 unchanged attribute hidden)
    }

One-Shot ARM Template Deployment

Deploying ARO is not a pleasant experience from an idempotency perspective, we've tried and it doesn't end well if you deploy over the top of an existing cluster. Once a native provider is available you'd hope that this behaviour will be addressed, but for now, deploying ARO is a one-time task.

When using the azurerm_resource_group_template_deployment resource we see deltas in state between executions due to handling of the template_content and parameters_content variables, even when we maintain the same configuration. We're telling Terraform to ignore these changes with a lifecycle block:

lifecycle {
  ignore_changes = [parameters_content, template_content]
}

Final Thoughts

Deploying ARM templates in Terraform isn't an ideal solution and should be used sparingly and only when no native provider option exists; we lose a degree of control and observability over the infrastructure deployment using this approach. In the ARO use case, this option is the lesser of two evils given the command line alternative.

Deployment of ARO is however relatively simple; provided the pre-requisites are met you should have a pain-free experience. Microsoft and Red Hat have done a fairly decent job of obfuscating the complexity and the clusters I've worked with seem stable and fully functional.

Having said that, here's a few gotchas to watch out for:

Release Cadence

There is some considerable lag between OCP release and ARO release that you should be aware of; if you run OCP on-premise don't expect to be able to upgrade ARO with the same frequency. Based on past cycles, there can be anything up to 4 months between an upstream OCP release and the corresponding ARO update hitting GA. Microsoft provide more information on their support lifecycle documentation.

Influence of Tagging Policies

Currently, when the managed resource group is created, it does not implement the resource tags that are placed on the ARO cluster object itself. The managed resource group is protected by a Deny ACL and can only be modified by the ARO service principal, so you're not going to be able to add them after the ARO deployment either.

Issues may be experienced with the ARO cluster deployment if you have Azure Policy applied to your subscription that prevents creation of resource groups without mandatory tags. Check the Activity logs on your ARO cluster resource group for more clues if your deployment fails; the error messages provided from the deployment are, at best, cryptic.

Missing RBAC

We've already mentioned it, but you must ensure that the service principals have the correct rights on the VNET before you deploy ARO. The example code ensures that this takes place, but if you're connecting to another existing VNET, double check the IAM first.

——

Find out more about the work we do here.

If you would like to discuss how can BlakYaks can support your organisation Azure transformation journey, please get in touch or book an introductory meeting.

Take a look at our other Azure related blogs:

If you liked this blog, share it! 🠋