Back to Portfolio

Zero Trust Architecture with Terraform IaC on AWS

Zero Trust Architecture
Identity
Keycloak HA · JWT
Certificate
step-ca · ACME
Policy
OPA · Rego
Mesh
Envoy · mTLS
Observe
CloudWatch · Alarms

This project documents my experience building a production-grade Zero Trust Architecture deployed entirely through Terraform Infrastructure as Code. The goal was to implement NIST SP 800-207 principles — default deny, continuous verification, microsegmentation — in a 13-container stack that provisions with a single terraform apply command and destroys in 30 seconds.

Scenario: FedSecure Financial needed a production-grade Zero Trust stack that could be versioned in Git, reproduced on any machine, and destroyed cleanly. Manual deployment of 30+ commands was replaced by 28 Terraform-managed resources across 2 modules.

Architecture Overview

The architecture implements five Zero Trust planes, each managed by Terraform:

  1. Identity Plane — Keycloak HA (2 nodes) behind nginx TLS load balancer, backed by PostgreSQL. Issues JWT tokens for every request.
  2. Certificate Plane — step-ca private CA with ACME protocol. Issues and auto-renews TLS certificates for the service mesh.
  3. Policy Plane — OPA with Rego policies served via bundle server (GitOps pattern). Default deny on every request.
  4. Secrets Plane — AWS Secrets Manager stores all credentials. Zero hardcoded passwords in code or config.
  5. Observability Plane — AWS CloudWatch alarms, logs, and metrics for continuous monitoring.
Terraform Docker Keycloak PostgreSQL nginx step-ca OPA / Rego Envoy Flask Secrets Manager CloudWatch Python NIST SP 800-207

Keycloak HA Identity Cluster

Two Keycloak nodes behind an nginx TLS load balancer, backed by PostgreSQL. Infinispan TCP clustering enables session replication — if one node dies, the other serves requests without session loss. The nginx load balancer terminates TLS using a Terraform-generated self-signed certificate.

modules/identity/main.tf — Keycloak Node 1
resource "docker_container" "keycloak_1" {
  name    = "keycloak-1"
  image   = docker_image.keycloak.image_id
  command = ["start"]  # NOT --optimized (see warning below)
  env = [
    "KC_DB=postgres",
    "KC_DB_URL=jdbc:postgresql://keycloak-db:5432/keycloak",
    "KC_HOSTNAME_STRICT=false",
    "KC_PROXY_HEADERS=xforwarded",
    "KC_HTTP_ENABLED=true",
    "KC_HEALTH_ENABLED=true",
    "KC_CACHE=ispn",
    "KC_CACHE_STACK=tcp",  # Infinispan clustering for HA
    "KEYCLOAK_ADMIN=admin",
    "KEYCLOAK_ADMIN_PASSWORD=${var.keycloak_admin_password}",
  ]
}
Keycloak --optimized pitfall: The stock Keycloak Docker image is pre-built with H2 drivers. Using command = ["start", "--optimized"] skips the build step, so KC_DB=postgres is silently ignored and Keycloak crash-loops trying to parse a PostgreSQL URL as H2. Always use command = ["start"] with stock images.

Envoy Service Mesh & Sidecar Pattern

Each microservice runs with an Envoy sidecar proxy sharing the same network namespace via Docker’s network_mode = "container:${...}". The sidecar handles TLS termination, authorization, and routing — the application code never touches security. This is the same pattern used by Istio and AWS App Mesh in production.

modules/mesh/main.tf — Sidecar Pattern
# Parent container owns the network namespace and ports
resource "docker_container" "api_gateway" {
  name  = "api-gateway"
  image = docker_image.api_gateway.image_id
  ports {
    internal = 9443
    external = 9443
  }
  networks_advanced {
    name = data.docker_network.mesh.id
  }
}

# Envoy sidecar shares parent's network — no ports here
resource "docker_container" "envoy_gw" {
  name         = "envoy-api-gateway"
  image        = docker_image.envoy.image_id
  network_mode = "container:${docker_container.api_gateway.id}"
  depends_on   = [docker_container.api_gateway]
}
Why ports go on the parent: Docker’s network_mode = "container:..." shares the network namespace. The sidecar has no network of its own — publishing ports on it causes a “conflicting options” error. All ports must be declared on the parent container.

OPA Policy Engine

Open Policy Agent evaluates every request against Rego policies distributed via a bundle server. The bundle is a tarball served by nginx — update the Rego file in Git, rebuild the bundle, and OPA picks up the new policy automatically within 10–30 seconds. This is the GitOps pattern for policy management.

files/bundles/authz.rego — Default Deny Policy
package authz

default allow := false

allow if {
    input.method == "GET"
    input.path == "/health"
}

allow if {
    valid_token
    has_required_role
}

step-ca Private Certificate Authority

Smallstep step-ca provides automated certificate issuance via ACME protocol — the same protocol Let’s Encrypt uses. It replaces AWS ACM Private CA ($400/month) with a free, self-hosted alternative. Certificates are short-lived and auto-renewed, eliminating the class of attacks that exploit expired or unrotated certificates.

modules/identity/main.tf — step-ca Container
resource "docker_container" "step_ca" {
  name  = "step-ca"
  image = docker_image.step_ca.image_id
  env = [
    "DOCKER_STEPCA_INIT_NAME=FedSecure ZTA CA",
    "DOCKER_STEPCA_INIT_DNS_NAMES=step-ca,localhost",
    "DOCKER_STEPCA_INIT_REMOTE_MANAGEMENT=true",
    "DOCKER_STEPCA_INIT_ACME=true",
  ]
  ports {
    internal = 9000
    external = 9000
  }
  healthcheck {
    test = ["CMD", "step", "ca", "health",
            "--ca-url", "https://localhost:9000",
            "--root", "/home/step/certs/root_ca.crt"]
  }
}

AWS Secrets Manager

Four secrets stored in AWS Secrets Manager — zero hardcoded passwords anywhere in the codebase. Terraform creates both the secret and its initial version, so credentials are provisioned as part of the infrastructure lifecycle.

modules/identity/main.tf — Secrets Manager
resource "aws_secretsmanager_secret" "keycloak_admin" {
  name = "${var.project_name}/keycloak-admin"
  tags = { Project = var.project_name }
}
resource "aws_secretsmanager_secret_version" "keycloak_admin" {
  secret_id     = aws_secretsmanager_secret.keycloak_admin.id
  secret_string = jsonencode({
    username = "admin",
    password = var.keycloak_admin_password
  })
}

Verification Suite

An 11-test automated verification script validates the entire stack — from individual container health to end-to-end authorization flow. Mesh service health checks use docker exec with Python’s urllib (Flask containers don’t include curl).

verify.sh — Default Deny Test
# Test: Default deny (no token = rejected)
DENIED=$(curl -s http://localhost:9443/api/v1/data | \
  python3 -c "import sys,json; print(json.load(sys.stdin).get('error',''))")
if echo "$DENIED" | grep -q "missing token"; then
  echo "PASS: Default deny (no token)"
fi
All 11 tests passing: Container count (13/13), Keycloak health, OPA health, mesh services (api-gateway, retriever, analyzer), end-to-end flow, default deny, Secrets Manager (4 secrets), CloudWatch alarms, step-ca health.

Security Controls

  • Default deny — OPA: default allow := false
  • Microsegmentation — 3 isolated Docker networks: frontend, mesh, data (internal-only)
  • mTLS-ready — step-ca ACME + Envoy sidecars at every service
  • Secrets management — AWS Secrets Manager, no hardcoded passwords
  • HA identity — 2 Keycloak nodes with Infinispan clustering, nginx TLS load balancer
  • Continuous monitoring — CloudWatch alarms for auth failures and cert expiry
  • Infrastructure as Code — Terraform state tracking, drift detection, reproducible deploys

Lessons Learned

  1. Keycloak --optimized flag — causes silent failures with PostgreSQL on stock Docker images. The build step is required to switch from H2 to PostgreSQL drivers.
  2. nginx DNS timing — crashes at startup if upstream hostnames aren’t resolvable yet. Fix: add resolver 127.0.0.11 for Docker DNS and depends_on for both Keycloak nodes.
  3. Sidecar port ownership — Docker’s sidecar pattern requires all ports on the parent container. Publishing ports on the sidecar causes a “conflicting options” error.
  4. Per-service Envoy configs — each sidecar needs its own YAML config with the correct stat_prefix and upstream port. Sharing a single config causes routing failures.
  5. Terraform stale plans — crash-looping containers change state between plan and apply. Use terraform apply -auto-approve to plan and apply atomically.
  6. Container health checks without curl — Flask containers don’t include curl. Use docker exec with python3 urllib.request instead.

NIST SP 800-207 Zero Trust Mapping

  • Tenet T1 — All data sources and computing services are resources. Each of the 13 containers is treated as an independent resource with its own identity and policy evaluation.
  • Tenet T2 — All communication is secured regardless of network location. TLS on nginx, mTLS-ready Envoy mesh, internal-only data network.
  • Tenet T3 — Access to individual resources is granted on a per-session basis. Keycloak issues short-lived JWT tokens with TTL.
  • Tenet T4 — Access is determined by dynamic policy. OPA evaluates every request against Rego rules in real time.
  • Tenet T6 — Authentication and authorization are strictly enforced before access is granted. Keycloak (authn) + OPA (authz) + Envoy (enforcement) pipeline.

Cost Analysis

step-ca replaces AWS ACM Private CA ($400/month) with a free, self-hosted alternative. All 13 containers run locally on Docker at zero cost. The only AWS charges are Secrets Manager (4 secrets at ~$2.40/month) and CloudWatch (~$4.00/month for logs, alarms, and metrics). Total: ~$6.40/month for a production-grade Zero Trust stack.

References

  1. NIST SP 800-207 — Zero Trust Architecture
  2. HashiCorp — Terraform Documentation
  3. Keycloak — Server Administration Guide
  4. Envoy Proxy — Configuration Reference
  5. Open Policy Agent — Policy Language (Rego)
  6. Smallstep — step-ca Documentation
  7. AWS — Secrets Manager User Guide
  8. AWS — CloudWatch User Guide