Reflecting on CDK |

It's been about two years where I've used CDK across several projects. Most were greenfield, and not a transition from a prior implementation. I had a brief test of Pulumi, but not an extended run. I used CDK largely with Typescript. When talking about CDK there are several main variations

cdk8s CDK for Kubernetes manifests.
CDKTF CDK for Terraform.
AWS CDk CDK for cloud formation.

Background

I come from a data center background, starting out with Puppet, SaltStack, and Chef. I pivoted on my second year to more of a software focus. Since that time I've bounced between Software, DevOps and SRE. I am against templating, and uncompiled infrastructure. I strongly prefer using a programming language for defining infrastructure.

CDK is a means to define cloud infrastructure in several lanaguges, and even have the modules such that they can be used across several languages.

Before CDK I had used Fabric8 for defining Kubernetes deployments via JVM languages. Largely wrapping JVM micro services deployments in it's own language.

I had even gone so far as to write my own layer like CDK. The ultimate goal was seeing your infrastructure as a dependency of your software service. I was aiming for having your infrastructure as a type you could extend. The below example would automatically generate out the relevant code for building an S3 bucket.

object CustomerData: InfraAWSS3Bucket {
  override val s3BucketName="CompanyNameCustomerData"
}

Culture

I have gotten into many wars that are akin to emacs vs vim in regards to template vs programming. I hadn't worked with HCL, much prior to using CDK. After my experience with legacy configuration management systems, and CI/CD systems, any configuration language derived infra was eventually going to become tech debt. Converseley I was told a number of times that programming creates harder to read infrastructure.

This can be highlighted as the ops teams I had been on had a motto of W.E.T, write everything twice. While the software focus was D.R.Y, do not repeat your self. To some copy and pasting the same HCL block, or using a primitive for loop was a reasonable option.

I was told the HCL approach is that it is akin to a living document. Where your Terraform files should reflect the desired state of your infrastructure. CDK is meant to provide building blocks to standardize your infra, using loops and iteration to reduce repetitive tasks.

This looping creates harder to grok infrastructure, at a surface level. Because you need to understand what the loop is doing.

This culture is at the core of CDK vs HCL debate. This also aligns with your internal company culture. Do you expect Software to interact more with infra, or is it strictly Devops?

I was working on a project where I was initially hired to implement a greenfield cloud architecture. I was hired to go forward with CDK, and initially slotted under the Software organization. As the team grew, and the software side grew to a larger team and a Director of Devops was hired. Their first decree of the new director was no CDK everything should be template driven. This was because they came from a devops template background, and saw CDK as an anti pattern. We were largely relegated to pager duty after that.

But we saw a drastic shift in engagement from the software side on the change to HCL. With CDK in Typescript matching the core frontend & backend stack. Engineers were regularly making pull requests to our CDK repo to add cloud features. We would guide them on best practices, and how to implement what they wanted. On the change to HCL the pull requests to our infrastructure repository dropped over night. Teams were now dependent on a smaller devops team, and feature deliverable slipped while waiting for new resources.

I've seen this a number of times defining infrastructure as more of a software approach garners more interaction from the Software side of the house. While the configuration language approach creates a divide where software contributes less to the devops platform.

This is where your culture comes in.

Who's setting up new cloud resources?
- Is it DevOps or should software be encouraged to make changes.
Template vs Code driven.

Design & Implementation

After my second project I settled on a common pattern for all of my CDK projects. It's abstraction atop abstraction, but this lead to an easier way to grok the current state of an environment.

File Tree

├── config
│   ├── Development.ts
│   ├── index.ts
│   ├── Production.ts
│   ├── Stage.ts
│   └── teams
│       └── CustomerExperience.ts
└── src
    ├── components
    │   └── aws
    │       └── s3.ts
    └── index.ts

Components

src is equivalent to a standard library for terraform or the cloud components. Components were defined as various cloud components i.e. S3, RDS, etc. All components followed a aliased type function and provided the same response interface.

export type TerraformOutputs = {
  [key: string]: string | number | TerraformOutputs | string[] | number[];
};

export type TerraformImportResource = {
  resourceType: string;
  resourceId: string;
  path: string;
};

export interface ITerraformComponentResponse {
  outputs: TerraformOutputs;
  imports: TerraformImportResource[];
}

type CDKComponent<C> = (
  // The CDK stack to add this component too.
  stack:Construct,
  // Core environment configuration, noted below.
  config: IEnvironment,
  // A generic extra configuration data set.
  extraConfig: C) => ITerraformComponentResponse

The components provided a list of outputs, and would either output to ssm, kubernetes secrets or other platform's. Imports for any resources that may need to be imported, and would generate an import script. Then the component information, which was used to create an IAM policy to access the resource. Teams and members were automatically managed in the configuration.

Configuration

Config is the meat and potatoes of interaction. All cloud resources are defined as simple Typescript objects for a team to use. The types were strongly documented. With the intellisense and docstrings it was easy to navigate how to implement a feature. There were two levels of configuration teams and root.

The root level was equivalent to a target environment to deploy too. This would be in essence your core compute where the application lives, and what comptue resources are needed by the application. As an example.

const Stage = {
  {
  env: "stage",
  domain: "eks.stage.customer.domain",
  cluster: {
    version: KubernetesVersion.V1_24,
    amiVersion: "1.24.10-20230217",
    nodes: [
      {
        name: "infra",
        subnetType: "public",
        azs: 2,
        minimumNodes: 2,
        maximumNodes: 4,
        desiredNodes: 2,
        instanceType: "t3a.medium",
        disk: 100,
        labels: {
          network: "public",
        },
        taint: [
          {
            key: "restricted",
            value: "devops_only",
            effect: "NO_SCHEDULE",
          },
        ],
      },
      {
        name: "app",
        minimumNodes: 1,
        maximumNodes: 4,
        desiredNodes: 2,
        instanceType: "t3a.xlarge",
        disk: 100,
        subnetType: "private",
        azs: 3,
        labels: {
          network: "private",
        },
        namedSubnets: "sandbox-eksv2-private-2a",
      },
  },
}

This is a stub from an eks cluster approach I did. It deploys an eks cluster. A couple key points.

version & amiVersion were union typed so only valid entries could be approved. We further seperated by these by having a PreProd union and Production union. So only AMIs vetted in prepord would compile in a Production configuration. This was manual.
env was a union type, and shared in a common npm data layer. Most of our connection strings, environment variables were all strongly type through a shared package. nodes
- The name became a kubernetes nodeType label, allowing for easy isolation of pods to a certain type.
- subnetType was strongly typed and allowed for isolation by vpcs. In another approach we allowed for security groups to be passed here.
- instanceType was strongly typed, also by PreProd or Production, ensuring that we used valid instance types dependent on the environment.
- minimum / maximum / desired configured the auto scaler for that node group.
- namedSubnets allowed to specify a vpc, this was useful when we peered with a vendor into a certain data set.
- azs scaled the node group across availability zones.
  - Our infra node pools which ran ingress , prometheus , etc. Only needed single fail over to a secondary availbility zone.
  - Our applications required higher scalability across three availability zones.
- taint allowed us to set taint on pods, the above ensured only devops sanctioned pods ran on the `infra node.

Once this core foundation was laid out it was very easy to replicate clusters, add new environments or test changes. With a matrix ci/cd job, you can run a configuration like Stage across multiple regions.

Teams defined the resources a team needed.

export const CustomerExperience : ITeamConfig = {
  name: "customer-experience",
  eksRole: "reader",
  members: [john, jane, jim],
  resources: {
    dbs: [
      {
        name: "sales_leads",
        diskSize: 10,
        engine: Engines.POSTGRES_14_3,
        public: false,
        user: "sales",
        instanceType: "db.t3.small",
        team: "customer-experience",
      },
    ],
    aurora: {
      serverless: [
        {
          name: "orders",
          team: "customer-experience",
          user: "orders",
          scalingConfig: {
            minimum: 4,
            maximum: 16,
          },
        }
      ],
      cluster: [
        {
          name: "users",
          user: "customer_users",
          team: "customer-experience",
          writerInstanceType: "db.x2g.xlarge",
          readerInstanceType: "db.r6g.large",
          readerScaleMaximum: 3,
          readerScaleMinimum: 0,
          readers: 1,
          sandboxInstanceType: "db.t3.large"
        },
      ],
    },
    buckets: [
      {
        name: "customer-invoices",
        public: false,
        encrypted: true,
        versioned: false,
        team: "customer-experience",
      },
    ],
  },
  repositories: [],
};

name is the team name and is a union type that maps to all available teams in the company. This union type fed IAM groups, eks namespaces, and a lot of other areas.
members was a list of team members defined in another configuration file. This file defined github ids, creating github teams Slack ids, and maintaining team slack and alert channels. pager duty groups etc.
eksRole was a union of admin|reader clusters were treated as immutable and changes only made via pull requests.
resources backed to cloud resources.
- dbs were rds instnaces.
  - instanceType was a union of types, using ts-poet and aws cli this union type was automatically generated.
  - engine similar to above was genrated from aws cli
  - public determined whether it was accesible outside of the private vpc / vpn.
- auorora allowed for aurora cluster serverless or provisioned.
- buckets were for s3.

This was a wide array of options, pretty much any vetted resource was available under their ITeamConfig. Each of these resources had a corresponding IAM policy. So each resource was added to the respective team group, and broken by environment. Each respective member under members had a list of environments they could access. Ensuring they were automatically granted to the appropiate environment resources. i found this enabled developers, a quick pr allowed them to spin up a new test cluster. Developers felt enabled to quickly test new features, and spin up new cloud resources. We used Infracost to gate pull requests, and ensure that a developer change didn't incure a high bill.

A tech radar was maintained which showed the status of each respective cloud component. So a team could easily see whether a component was in testing or verified as a stable choice. As an example a team was researching database options, we had listed peered mongo and rds as stable, with aurora clusters in beta, and aurora serverless in alpha. The team had a largely dormant service with occasional spikes once or twice a month. They opted for serverless, we worked with product. To align flow over on the sprint for bugs found in our wrapper. They then helped move aurora serverless to beta.

What defined stable?

Stable was guarenteed to have:
- Automatically generated dashboards & alarms.
- Vetted IAM permissions
- Runbooks
- Example code on using a generated resource, and accompanying documentation.
- Best practices determined by other users.
Beta:
- IAM permissions.
- Tested deployment.
Alpha:
- Guarenteed to deploy.

Ballooning Costs

With the above team resources there is a concern of exorbinant costs. Where a team member may make a exorbinant infrastructure change. We used Infracost to automatically gate pull requests. If the amount was above a certain threshold. It required either DevOps or manager approval to move forward. If the spend was below a threshold, and they were a CODEOWNER they could merge.

From the team config we managed team members to github teams. So as members were added they automatically got mapped to github. This allowed for us to be largely hands off and non blocking to developers who needed new resources.

Build & Synth Time

When running cdk there are several approaches, you can do everything through npm.

Applying Changes

# This deploys the terraform cdktf stack.
-> cdktf apply

But I found this was a bit slower at the time of iniital implementation, and more difficult to debug. I found that using the regular terraform command was an easier approach.

{
  "scripts": {
    "terraform:synth": "./node_modules/.bin/cdktf synth"
  }
}

cd cdktf.out/stacks/$StackName
terraform init; terraform plan
terraform apply

Synth Time

One of the biggest bottle necks I had was in regards to synth time. On one project, with several terraform providers. The synthed and compiled typescript output was about 200MB. This was an expensive task to run and re-run. I found it best to layer node modules on top of each other. With several core repositories.

cdktf-providers A npm module that pulls in all your orginizations providers.
- This will compile typescript into a module you can use.
cdktf-stdlib A npm module that implements the components noted above.
- Depdends on cdktf-providers
cdktf-$stack your individual stack
- Depends on cdktf-stdlib and cdktf-providers

When you're going to deploy a piece of infra, it only has to compile and build a small subset of code. The $stack module should only define the configuration for the components, and call them in a terraform stack.

When building and synthesizing, the type script compilation was one big time sink. The other was synthesizing large sets of stacks.

Smaller Stacks

With the above, it's best to build single intent stacks. Utilize cross reference stacks, or data lookups. But keep a stack to a small intent. Rather than a core stack that provides VPC, EKS, and ECR. They can be split out into several different modules.

Closing

Would I still use CDK? Yes, I can't see going back to a configuration based operations approach. I've taken this further since CDK, focusing on Nixos, where my operating system is also a compiled set of code.

There are challenges, and the biggest portion is the culture fit. As you go to CDK, it shifts your hiring practices. As well interaction with the engineering team.

Despite this I feel after the initial uptake, it can drastically improve developer experience, and ease developing cloud resources.