SaFi Bank Space : Terraform Agents

General information:

We use terraform agents to provision our resources in our environments(brave, stage, etc.) and in our CICD workspace.

Terraform agents are used because they run directly in GCP and can modify resources to which TFC itself would not have access. More info why we choose to use terraform agents can be found in Terraform Agents - PoC .

Deployment:

The agents are deployed as a vm which is running in a project, for example in brave it’s running in the sharedvpc (safi-env-brave-sharedvpc).

Deployment steps, first we need to select if the environment needs agents and how many it needs, that’s selected in our shared_variables.tf

    brave = {
      domain_name = "smallog.tech"
      features = {
        ably                = true,
        confluent           = true,
        confluent-mp        = false,
        gke_release_channel = "REGULAR"
        environment_type    = "dev"
      }
      tfc_agents = 2
    }

Once this change is done, the dispatcher is used to create the agent_pool and agent_token and distributes it to the necessary workspace as variables.

tfc_agents.tf:

resource "tfe_agent_pool" "this" {
  for_each = toset(concat(keys(local.safi_environments), ["cicd"]))

  name         = format("%s-%s-tfe-agent-pool", local.prefix, each.key)
  organization = var.tfe_organization
}

resource "tfe_agent_token" "this" {
  for_each = toset(concat(keys(local.safi_environments), ["cicd"]))

  agent_pool_id = tfe_agent_pool.this[each.key].id
  description   = format("%s-%s-tfe-agent-token", local.prefix, each.key)
}

20_environments_variables.tf

        key         = "tfc_agent_token"
        value       = tfe_agent_token.this[each.key].token
        category    = "terraform"
        sensitive   = true
        description = local.managed_by_terraform
    }],

Those variables, are then used to provision the necessary VMs.

tfc_agents.tf

locals {
  tfc_agent_machine_type = {
    dev   = "n1-standard-2"
    brave = "n1-standard-2"
    stage = "n1-standard-2"
    prod  = "n1-standard-2"
  }
}

# SA
# -----------------------------------------------
resource "google_service_account" "tfc-agent-sa" {
  provider = google.sharedvpc

  account_id   = "${local.prefix}-${var.env_name}-tfc-agent-sa"
  display_name = "${local.prefix}-${var.env_name}-tfc-agent-sa"
}

# Subnetwork networkUser in sharedVPC
resource "google_compute_subnetwork_iam_member" "tfc-agent-private-default-network-user" {
  provider = google-beta.sharedvpc

  subnetwork = google_compute_subnetwork.private-default["tms"].id
  role       = "roles/compute.networkUser"
  region     = var.google_region
  member     = format("serviceAccount:%s", google_service_account.tfc-agent-sa.email)
}

# KMS
# -----------------------------------------------
# Create a KMS key ring to store the key
resource "google_kms_key_ring" "tfc_agent_key_ring" {
  provider = google.sharedvpc

  location = var.google_region
  name     = "${local.prefix}-${var.env_name}-tfc-agent-key-ring"
}

# Create a crypto key in the generated above key ring
resource "google_kms_crypto_key" "tfc_agent_crypto_key" {
  provider = google.sharedvpc

  name            = "${local.prefix}-${var.env_name}-tfc-agent-crypto-key"
  key_ring        = google_kms_key_ring.tfc_agent_key_ring.id
  rotation_period = "100000s"
  lifecycle {
    prevent_destroy = true
  }
}

data "google_project" "sharedvpc_project" {
  provider = google.sharedvpc
}

# Giving permissions to Service account to use the key
resource "google_kms_crypto_key_iam_binding" "tfc_agent_crypto_key_iam_binding" {
  provider = google.sharedvpc

  crypto_key_id = google_kms_crypto_key.tfc_agent_crypto_key.id
  role          = "roles/cloudkms.cryptoKeyEncrypterDecrypter"
  members       = [
    format("serviceAccount:service-%s@compute-system.iam.gserviceaccount.com", data.google_project.sharedvpc_project.number),
  ]
}

# Reserve IP 
# -----------------------------------------------
resource "google_compute_address" "tfc-agent-private" {
  provider = google.sharedvpc
  count    = local.safi_environments[var.env_name].tfc_agents

  name         = "${local.prefix}-${var.env_name}-tfc-agent-${count.index}-v2"
  subnetwork   = google_compute_subnetwork.private-default["tms"].id
  address_type = "INTERNAL"
  region       = var.google_region
}

# Instance
# -----------------------------------------------
resource "google_compute_instance" "tfc-agent" {
  provider = google.sharedvpc
  count    = local.safi_environments[var.env_name].tfc_agents

  name         = "${local.prefix}-${var.env_name}-tfc-agent-${count.index}-v2"
  machine_type = local.tfc_agent_machine_type[var.env_name]
  zone         = var.google_zone

  tags = ["tfc-agents"]

  boot_disk {
    initialize_params {
      image = "cos-cloud/cos-97-16919-103-28"
    }
    kms_key_self_link = google_kms_crypto_key.tfc_agent_crypto_key.id
  }

  labels = {
    container-vm = "cos-97-16919-103-28"
  }

  metadata = {
    google-logging-enabled    = "true"
    gce-container-declaration = yamlencode({
      spec = {
        containers = [
          {
            image = "docker.io/hashicorp/tfc-agent:1.3"
            name  = "${local.prefix}-${var.env_name}-tfc-agent-${count.index}"
            env   = [
              {
                name  = "TFC_AGENT_TOKEN"
                value = "${var.tfc_agent_token}"
              },
              {
                name  = "TFC_AGENT_NAME"
                value = "${local.prefix}-${var.env_name}-tfc-agent-${count.index}"
              },
              {
                name  = "TFC_AGENT_SINGLE"
                value = true
              }
            ]
          }
        ]
        stdin         = false
        tty           = false
        restartPolicy = "Always"
      }
    }
    )
  }

  network_interface {
    network    = google_compute_network.shared_vpc.id
    subnetwork = google_compute_subnetwork.private-default["tms"].id
    network_ip = google_compute_address.tfc-agent-private[count.index].address
  }

  service_account {
    email  = google_service_account.tfc-agent-sa.email
    scopes = ["cloud-platform", "userinfo-email"]
  }
}

Integration:

The agents are added to firewall rules and authorized networks as necessary so they can run the necessary terraform commands for modification the resource. For example for GKE we add them to authorized networks.

authorized_cidrs = concat([
    {
      cidr         = local.safi_network.cicd.k8s.nodes
      display_name = "Argo CD K8s nodes"
    },
    ], [for index, address in data.terraform_remote_state.common_workspace.outputs.tfc_agents_ip_addresses :
    {
      cidr         = "${address}/32"
      display_name = "TFC Agent #${index}"
  }])

Limitations:

With our setup of multiple workspaces it takes a while the agents to provision a whole new environment. The more agents we have in an environment the faster it goes.

Starting / Shutting down the agents takes a long while.

If you provision resources (for example GKE) with one agent, and then add another you might have issues because of the way we add them to authorized_networks.

New agents are really expensive.

Scalability:

The agent scale pretty well the more you have the faster they implement the resources.

The scalability is dependent on buying more agent licenses.