BlakYaks

View Original

Test Engineering with Terraform

Terraform has significantly evolved its testing strategies over time. Initially, testing Terraform code involved time-consuming manual checks or third-party tools, often leading to inefficiencies and errors. Recognising the need for better testing methodologies, HashiCorp have now introduced a native framework that streamlines the testing process by allowing test assertions within the Terraform codebase itself, using native HCL. This shift towards more integrated and less cumbersome testing is particularly useful within automated build pipelines and can be employed in both root and child module use cases.

In this post we'll take a look at the full suite of testing features within Terraform today and explain where they can be used in your code lifecycle.

Terraform Testing Components

Terraform includes several components that work together to ensure the reliability and efficiency of your infrastructure code. Whilst the native test framework sits outside of the runtime workflow, a number of other features provide assurance prior to and during deployment of the code itself.

Test Type Phase Description
Validation Plan Static assessment of input variables to ensure that they meet criteria for the module.
Conditions Plan
Apply
Pre/Post conditional validation applied to resources perform inline checks to validate that inputs and outputs are configured as expected
Checks Plan
Apply
Checks provide non-blocking assertion of conditions and are performed at the end of the plan and apply phases
Tests N/A Tests are executed outside of the plan/apply phase and perform unit or integration testing of their associated modules

Test Authoring

We write tests in an identical way, regardless of whether we use them as conditions, validations or assert tests:

  • Tests must always specify a condition and an error_message
  • The condition is an evaluation of one of more tests, which must return either true for a pass, or false for a failure
  • The error_message is a string that will be returned to the user specifying why the test has failed.

When writing condition code, this can become challenging depending on the type of data being tested. For example, the validation block below needs to check a variable defined as a map(object) for a specific sku property. To do this, we must loop though the values and wrap this in an alltrue function:

See this content in the original post

There are some projects underway that look to simplify test case production via use of provider-based functions. One such project is the assert provider which wraps a number of test functions that are commonly used in testing.

This provider is in the process of being moved into the HashiCorp namespace so may not be available via the link above at the time you are reading this.

Test Cases

Let's look at some examples of how and why you would use these features in your code. We'll walk through a basic Terraform module demonstrating each type of test as we go.

The code examples shown below are available on our public GitHub repository. Should you wish to follow the examples yourself, you'll need a Linux terminal with Terraform installed - and that's it!

Let's start with the simplest test, input variable validation.

Input Variable Validation

Input variable validation allows us to check that the values provided to our Terraform code are within acceptable limits and fit for purpose. Validation blocks are appended to variable definitions and are evaluated early in the deployment at plan time. Variables can include one or more validation blocks; all validation blocks must return a true condition value to pass the test.

In this example, we want to check that the string provided for the file_name input variable is equal or greater than 5 characters:

See this content in the original post

If we run our code, Terraform will fail during plan if we do not pass the validation:

See this content in the original post

It is also possible (since Terraform v1.9) to check values between variables. To expand the example above, if we had another variable, called directory, we could also validate that this had a specific format:

See this content in the original post

Use Cases

In real-world scenarios, variable validation is widely used where underlying providers expect resources to follow specific conventions; for example, where one parameter must be set to a specific value based on another property of the resource.

At BlakYaks we use a test-driven approach to validation, writing tests for each validation as they are added to the variables. Not only does this ensure good test coverage in our codebase, but it also allows us to test the functionality of the validations themselves during the development cycle. We'll look at this in more detail during the terraform test section.

Checks

Checks were introduced as a precursor to the native test framework as part of the Terraform v1.5 release. A check block performs inline testing and will run at the end of both plan and apply phases. Checks differ from the other inline tests in that they do not influence the success or failure of the deployment; if they fail then warnings will be produced but the overall status of the deployment will not be affected.

Check blocks contain one or more assert blocks; these also form the basis of many native tests and return binary results of a condition, in the same way a validation would. In our example, we'll check that an environment variable USER_VAR has been provided, and trigger a check failure if not:

See this content in the original post

By convention, we normally write all checks into a separate file, > checks.tf> .

If we run the code, but fail to pass the expected environment variable, we will see a warning such as that shown below. Note that if the code did not trap this error elsewhere (we'll get to that shortly) then it would continue to deploy regardless of the check failure.

See this content in the original post

Apply-Time Checks

Since checks run at both plan and apply time, there may be cases where the input data for the check may not be known at plan time, for example the check may validate the dynamic output from another resource that is created as part of the apply phase.

We can simulate this behaviour by adding another check such as that shown below:

See this content in the original post

Since we are checking the properties of a file created by the local_file.output_file resource, this will not exist at plan time and therefore we'll see a check warning when we attempt to run a plan (even if we resolve the check above by passing the USER_VAR to the environment):

See this content in the original post

If we run terraform apply then the check will fail on the initial plan, but will pass once the deployment has been completed, demonstrating the two-time check runs that are part of the deployment process.

Use Cases

Given that checks do not impact the operation of the deployment, they should not be relied on for critical validation of your code. For any mission-critical deployment-time checks, we would recommend use of pre and post condition blocks over checks since they are evaluated during deployment and impact both plan and apply success directly. Checks are of most use within Terraform Cloud (HCP) environments and have specific use cases there including drift detection and continuous validation.

It should also be noted that check blocks directly impact test validation and will cause test failures if they fail. In most cases this would be expected, however, the apply-time check example above would cause plan-only checks to fail and would therefore need to be added to the test expected_failures list to avoid this. An example of this is provided in the demo repo.

Pre and Post Conditions

Conditional resource checks were one of the earliest test enhancements, released back in Terraform v1.2. These checks are associated with specific resources (specifically, resources, data objects and outputs) as part of their lifecycle blocks, and can operate in one of two modes:

  • precondition checks take place BEFORE the resource has been evaluated
  • postcondition checks are completed AFTER the resource has been evaluated

In both cases, tests will be evaluated as early as possible in the deployment lifecycle; both types of check will be checked during the plan phase, but where checks refer to unknown resource details this may be deferred to the apply phase. You can have multiple pre and post conditions on a single resource, depending on your requirements.

Conditional checks can have many complex configurations and are therefore best explained with an example.

See this content in the original post

In the snippet above, we are using an external data source to evaluate the USER_VAR environment variable. Previously, we had configured a check block to perform this function, however, even if it failed the code would continue and subsequently fail later on. By using a postcondition block, we ensure that our code fails at the point of evaluating the data rather than passing bad data downstream to other resources. Let's run the code without the environment variable set and check the output.

See this content in the original post

Here we can see that, due to the postcondition failure, the plan was halted, and no further action was taken - which is exactly the behaviour we wanted. Note that we would also have received this error during a terraform plan since the data object is evaluated as part of the plan phase.

Let's add a precondition to our output variable:

See this content in the original post

We're testing for two conditions before we allow the output to process:

  • The USER_VAR result must be equal to 1
  • The output file must exist (via a dynamic check of the file SHA1 hash)

Note that we don't use lifecycle blocks in output variables, nor do they support postcondition blocks since they are essentially the last thing that Terraform processes (and therefore have no dependents). We can set the USER_VAR environment variable to 2, causing an apply failure (additional output has been omitted for brevity):

See this content in the original post

We can also demonstrate an important difference between checks and conditions with regards to apply-time testing. If we set our USER_VAR to the expected value and re-run as a plan, you will notice that the second (SHA) precondition is not evaluated:

See this content in the original post

When we run conditions that cannot be evaluated during plan, warnings are not displayed, however, they will be evaluated during the apply phase. To prove this, let's temporarily break the second condition by reversing the logic (note the ! in front of the condition):

See this content in the original post

If we apply this time, we should get an error prior to the output:

See this content in the original post

Here we can see that the file was created by the local_file.output_file resource, but the output failed due to our faulty precondition statement. Let's fix that by removing the ! and confirming that our code is working:

See this content in the original post

Use Cases

As demonstrated above, conditional blocks are excellent for preventing bad output data (via use of precondition blocks directly on the outputs) and we use them extensively within child modules for that purpose. In most cases, providers will fail when a resource is not created correctly, but where we use external APIs that produce unmarshalled results (such as the external data source in our demo) then precondition checks on downstream resources are useful for validation.

Another approach we can take is to use precondition blocks within terraform_data resources as internal validation gates. In this case, shown below, the terraform_data.validation resource is made a dependency of other resources, blocking at plan stage if the template_file location cannot be found. Whilst the precondition rule could have been added directly to the downstream resource, the advantage of this approach is that multiple downstream resources can be linked to the same set of condition rules without repetition.

See this content in the original post

Generally, we use precondition blocks to ensure that input data (either from a variable or another data object or resource) is valid, and postcondition blocks to check that resources have been created as expected with supported configuration. There is some overlap with input validation here, so be careful not to use conflicting checks or duplicate existing ones. Simplicity is key, particularly with larger code bases where a high number of conditional resources can become difficult to maintain and troubleshoot.

Terraform Test

The last, and possibly most useful tool in the Terraform test suite is terraform test, a native test framework that allows a number of test cases to be executed against our HCL code. Introduced in Terraform v1.6, the native testing capabilities have been incrementally updated since then, now supporting mock providers and resources (since v1.7) and continues to be developed as a first-class component of both OSS and HCP versions of Terraform.

Prior to release of the native test framework, we would need to write tests outside of Terraform using scripts or other third-party tools, such as Terratest. The Hashicorp BSL license change (v1.5.6) had an impact on a number of these external tools and therefore the native framework was well-timed to replace them (which is exactly what we did internally at BlakYaks).

The native tests are written in HCL and (normally) sit directly alongside the Terraform code in a tests directory. Using this convention-based approach, we can simply run terraform test from our root module directory, and any tests located in this directory will be executed.

See this content in the original post

Terraform uses an in-memory state for all tests within a single test file which is not accessible to the end user. State is shared between all run blocks that reference the same module source and has a lifecycle paired to the test file (shown in the tearing down action above). The Hashicorp documentation has more details on how this works.

Let's look in more detail at tests, and how they are structured.

Test Syntax

All tests are defined in HCL files within the tests directory of the module to be tested, and each file should use the extension *.tftest.hcl so that Terraform automatically detects them as tests. Each test is defined within a run block within the test file, and there can be multiple run blocks per test file. Each run block should be uniquely named within the file and is the equivalent to a single test case.

The Terraform documentation does a good job on describing the test format, so we won't go over that here. Let's look at our example code to demonstrate how the tests work in practice.

The validation.tftest.hcl test file from our example is shown below.

See this content in the original post

Here we have two tests that check our input validation on the var.file_name variable. Our validation on the variable should fail if we provide a value that is less than 5 characters in length, so we are running two checks, one that should fail and one that should pass.

  • Note that command = plan parameter on both tests. By default, run blocks will attempt to apply the tests, but in this case we only want to validate the input variables, so we need to tell the test that we only need to run a plan for this.
  • The variables block defines the variables that we are passing to each test run. In the top example we pass fail as we want the test to fail (4 characters), in the lower example we expect this to pass as we are passing success (7 characters). If we need to pass additional variables, we can add them to the variables block (in most production cases there would be multiple input variables required).
  • The expect_failures list parameter is used on both cases, for different reasons. With the failing example we provide the var.file_name and data.external.env resources since we expect both of those elements to trigger a failure. The var.file_name variable will trigger a variable since we are passing an invalid value to it. The data.external.env will fail since we are not passing in the USER_VAR environment variable, which is why we also pass this to the second test instance.

In the example code we have removed the data.external.env expected failure. To ensure that the tests pass, you must provide the USER_VAR variable when running the test, e.g. USER_VAR=1 terraform test.

Let's now look at a more complete example, shown in the module.tftest.hcl file. In this test, we will run a plan, followed by an apply, and finally we will check the output variables to validate results are as we expected. The format of the file is similar to the first example, with multiple run blocks that are executed sequentially, top-to-bottom.

See this content in the original post
  • In addition to the run blocks, we have defined a global variables block that is used by each run block, unless overridden (as shown with the plan instance)
  • As part of the apply instance, we have defined assert blocks that check the state for expected values. Any failures will fail the test, and subsequent run blocks would be skipped
  • The output instance targets what is essentially an empty Terraform directory (the ./tests directory does not contain any Terraform files other than tests). The assert block within this section checks the output value from the previous run.apply block and demonstrates how outputs can be used between test cases.

We can run this test in isolation as shown below:

See this content in the original post

The code will be deployed, tested, and then destroyed by Terraform as part of the test run. In our example this is quick since we are only creating a local file, running more realistic tests will usually require that cloud infrastructure is created and destroyed, which will take considerably longer.

There is always the chance that tests may fail in such a way that infrastructure is not entirely cleaned up by the test process. In production scenarios this is usually catered for by targeting dedicated test environments (i.e. accounts or subscriptions) so that maintenance tasks can cleanup anything that failed.

Note that we must provide the environment variable to our test; if this is missed our postcondition will throw an error such as that shown below.

See this content in the original post

As a workaround for this type of problem, and a shortcut to long deployment and test times, let's introduce mock resources to our tests.

Mocks

Mocks have formed a part of software testing frameworks for some time. Put simply, mocks enable tests to be completed against simulated datasets or environments as an alternative to spinning up real supporting resources for test cases. In the Terraform use case, we interact with providers; these provide data and resources (usually via cloud service provider APIs) that we manipulate to form our IaC deployments.

We can use mock_provider instances to simulate the providers used by our Terraform code within our test files; these providers will be used when the code is executed rather than real providers. However, when starting out using mock providers and resources, we must be aware of how Terraform handles return data when a provider is mocked. The Hashicorp documentation describes this in detail, but essentially we have to remember that mock providers have no knowledge of the underlying "real" APIs, and will therefore always return data in the same way, based on the type of data being produced. For example, any string type data will always be a random 8-character string when it is returned from a mock provider instance, regardless of what that data relates to.

Generic return data can sometime cause issues with our code when the provider code expects a specific format to be returned between resources. One example (that we often come across!) would be the azurerm provider, that expects resource IDs to be correctly formed with segmented data relating to the ARM resource ID. To avoid issues when running this mock provider, we add mock_resource blocks that will override the normal random string produced by the mock engine:

See this content in the original post

In this example, whenever we create an azurerm_virtual_network resource using the mock provider, the id property will be set to the delimited string we have provided. Although not a real resource ID, it is convincing enough to pass the string formatting tests embedded into the azurerm provider codebase and will allow the code to be deployed in our tests.

If we look at our simple code module, we can make use of mock providers for our tests by adding the following to our test file (the example code is part of the mock_module.tftest.hcl file):

See this content in the original post
  • The local provider override allows us to test file creation without creating a file on our disk. We don't need to amend the output from the provider, so simply defining the mock_provider block is enough in this case.
  • By using a mock external provider we can override the return data external.result property so this it appears that our environment contains the expected variable of USER_VAR=1. This kind of override is more typical of mocks used in software development and allows us to simplify our test process.

If we run the mock tests, we can now do so without specifying our environment variable to prove the override is working:

See this content in the original post

To demonstrate the difference between a mock test and a real one, we can run the two module tests in verbose mode and compare the differences in output (by adding the --verbose switch we see the plan and apply output in more detail).

The real provider will return proper file hashes and permissions for the file within the output of the apply:

See this content in the original post

The mock provider, however, will show random strings for any values that have not been provided to the provider:

See this content in the original post

Use Cases

Native testing should form a key part of your Terraform SDLC, just as testing does in mainstream software development today. Whilst we may never get to 100% coverage, we should try and embed tests for key code features and write tests that specifically capture functionality within our modules. Testing (especially with mock providers) is a useful tool during development as we can check our work as we add new features or perform regression tests when updating existing code bases.

At BlakYaks, we use native tests in two main ways today:

  • As part of our child module development lifecycle, module tests are executed by our CI pipelines, ensuring that any pull requests must pass testing before they are considered for merge. Mock providers are used exclusively for this testing which improves code portability (tests can be executed anywhere) and speeds up the CI process.
  • As part of root module deployment pipelines, tests are executed within the deployment pipeline to smoke test the deployment and to ensure that the configuration and code is valid before it is deployed. In most cases, we will make use of mock providers for this activity since they are faster to run and do not require dedicated test environments to execute against.

Future Development

Much of the testing framework in Terraform is relatively new and will be subject to updates and improvements over the coming releases. The mock provider addition in v1.7 remains in beta at time of writing (although it is entirely functional) and we expect this will be developed further as users come onboard and request new features.

There are also some limitations in the test framework that we'd expect to be addressed:

  • There is experimental support for export of test results to Junit format in the alpha releases of v1.10. This will be of interest to those using the test framework in automation pipelines since it will allow ingestion of test data into build pipeline results (such as those in Azure DevOps).
  • There is no support for idempotency testing within the current framework which would be a welcome addition, particularly when developing child modules.

Summary

When used together, the Terraform test components form a comprehensive testing strategy, ensuring that your Terraform code is not only functional but also adheres to best practices and remains robust throughout its lifecycle. The native test framework really raises Terraforms game when it comes to a code-first infrastructure solution, and we're excited to see where this goes over the next few releases.

Reach out to us here at BlakYaks to see how we can help you on your journey to IaC excellence.