There are several principles in designing an Infrastructure as Code (IaC) that DevOps and Cloud Engineering teams should follow from day 0 or should be considered for refactoring an existing setup. IaC is not just well-structured code in HCL formatted files and sequential commands (init, plan, apply, etc); it also requires a broad design and, like any other system must have a scalable and extensible architecture.
This blog post is the conclusion of my experiences since 2016 when I started working with Terraform. In the following, eleven principles and best practices of an IaC design are explained briefly although each one would require a separate post or series of posts to cover in detail.
- GitOps
That’s a “Must-Have”! Transparency between developers for any changes through pull requests, using Git tags, and setting up a pipeline for automated planning and applying via GitOps is essential. In the following, all Git-related principles will be explained. Therefore, it’s important to incorporate GitOps into all aspects of IaC design.
- DRY
Don’t Repeat Yourself! In an IaC codebase, there is often shared or common code across modules, outputs, environments, stacks, regions, etc. As the setup grows, the code tends to be repeated making maintenance increasingly difficult over time. To simplify your IaC structure, consider integrating Terraform or OpenTofu with a thin wrapper like Terragrunt, or include/call reusable code from parent directories in Pulumi. This approach keeps the codebase simpler, shorter, and easier to maintain. So, avoid designing with pure Terraform or OpenTofu.
- Isolation
Assign a separate repository for each team. By isolating each team’s resources, this approach reduces pull request traffic and simplifies managing change requests. Besides that, issues with the plan/apply pipeline or broken codebase in one repository won’t affect other teams. Also, maintaining the CODEOWNERS file and assigning reviewers becomes much easier.
- Directory Structure
Organizing resource templates simplifies your daily tasks. One of the operations with IaC is managing existing code by adding, updating, or deleting resources. It’s important to find them easily, especially when facing with a large environment. So, it’s more convenient to follow a hierarchy and group template files based on provider, environment, stack, or any other structure that meets your needs.
- Modularization
Instead of repeating yourself by writing raw Terraform “resource” blocks, create modules based on developers’ use cases, call them with Terragrunt, and reuse them across multiple stacks and environments. This way, any changes to the modules will automatically be reflected in all the resources across stacks and environments, making it easier to enforce DevOps standards and policies consistently through the modules.
- Versioning
It’s important to keep the modules in a separate Git repository, apart from the teams’ repositories. This allows you to manage module versioning using Git tags and protects existing resources in the teams’ repositories from new or breaking changes. Resources within stacks can then be migrated to the newer version when they are ready or once they are compatible with the updated parameters.
- Remote State
Keeping state files in remote storage solutions like S3 is a common practice. Storing them remotely allows you to use the versioning feature, enabling you to revert to previous versions in case of state file corruption. This approach is also essential when your entire IaC configuration is managed in Git and you rely on pipelines to plan and apply changes. Also, remote storage is much safer than storing state files locally, providing better security and reliability.
- Encrypted Secrets
DO NOT store credentials in plain text within templates. Secrets should always be stored in a secure service like AWS Secrets Manager (or similar services from other providers) or HashiCorp Vault. These secrets can then be fetched using “data source” blocks in modules and passed as input variables in your IaC repositories and stacks. This ensures that secrets remain safe and are never stored in plain text, even in state files. Additionally, this approach supports secret rotation and can be integrated into Kubernetes deployments, enabling secrets to be shared securely between your IaC and containerized workloads.
- Auto-Tagging
An auto-tagging mechanism is highly beneficial, allowing each resource managed by IaC to be tagged based on the directory structure and hierarchy. Tags can include the Git repository name, environment, stack, resource type, resource name, and an additional tag like “IaC = true.” This approach simplifies tag tracking to determine where the IaC code resides, where the resource originated from, and whether it was created by IaC. Moreover, this tagging strategy helps the FinOps team in tracking the costs of specific environments, teams, or stacks using the provider Cost Explorer or third-party tools like Looker.
- Import-as-Code
The import block was recently introduced by OpenTofu and Terraform. At first glance, it may seem a bit static, but when combined with the locals block in modules and a map of strings as an input variable, it can significantly simplify the process for DevOps teams. This approach enables the implementation of an import-as-code mechanism, allowing resources to be imported by simply defining the corresponding templates. As a result, there’s no need for the tedious use of terraform import commands, making it much easier to import manually created resources—especially when providing DevOps services in a self-service model to developers.
- Policy-as-Code
With Policy-as-Code, DevOps teams can more easily define and enforce policies, standards, and best practices across development teams. Policies such as blocking all network protocols and ports to the internet through security groups, preventing IAM FullAccess, or restricting the creation of expensive EC2 instance types, among others, can be controlled and managed using tools like Open Policy Agent and HashiCorp Sentinel. These tools can be integrated at the module level, providing significant benefits in terms of security enforcement, performance optimization, and cost control.
- Pipeline
After implementing all principles, especially the Git-related ones, a pipeline becomes essential to achieve full automation. The process starts when you make a change to your IaC templates, create a pull request, and request a code review from a teammate. Like modern software development workflows using CI/CD, the pipeline should then roll out your infrastructure changes to the target environment. This can be accomplished using various solutions, such as an open-source tool like Atlantis, integrating with an existing CI/CD tool like Jenkins, or leveraging cloud-based platforms like Terraform Cloud or Pulumi Cloud. The pipeline continuously monitors the Git repositories via webhooks, triggers a plan, and waits for your approval to apply the changes.
- Documentation
Last, but not least, it’s crucial to document every aspect of your IaC for the benefit of other teammates in DevOps. As the design becomes complex and requires periodic maintenance, clear documentation will be essential. Additionally, providing detailed documentation for modules, explaining how to onboard new resources into IaC, and outlining the required and optional input parameters and variables is invaluable. This is especially helpful when your DevOps workflows are designed for a self-service model, enabling developers to easily understand and use the infrastructure.
Conclusion:
This summarizes everything I’ve learned about designing an IaC system over the years. Each point mentioned in this blog post deserves a dedicated article for a more detailed explanation, and I plan to write about them in the future.
As new features are introduced by tools like HashiCorp, OpenTofu, Pulumi, and others, IaC may become even more complex. As a result, IaC is evolving into a specialized field within DevOps, making it crucial to continuously learn and stay up to date with market trends.