Back to Portfolio

AI/ML SecOps — Production MLOps Pipeline on AWS with Terraform

MLOps Pipeline Architecture
Provision
Terraform · VPC · IAM
Train & Register
MLflow · scikit-learn
Serve
SageMaker · ECS Fargate
Monitor
Lambda · CloudWatch
Assess
NIST 800-53 · fpdf2

This project documents my experience deploying a complete fraud detection MLOps pipeline to production on AWS using Terraform Infrastructure as Code. The goal was to take a local lab prototype and build enterprise-grade infrastructure — network isolation, encrypted storage, automated drift detection, and compliance-mapped security assessment — that a real security operations team could rely on.

Scenario: CyberNova Solutions completed a lab-based security assessment of their fraud detection pipeline. The CISO authorized production deployment. Every resource must follow least-privilege IAM, encryption at rest and in transit, and be auditable via CloudTrail.

Architecture Overview

The system provisions 40+ AWS resources across an integrated pipeline:

  1. VPC — Isolated network with public and private subnets across 2 availability zones
  2. ECS Fargate — Serverless container hosting for MLflow behind an Application Load Balancer
  3. RDS PostgreSQL — MLflow metadata backend replacing SQLite for concurrency and durability
  4. S3 — Encrypted, versioned bucket for model artifact storage
  5. ECR — Private container registry with immutable tags and vulnerability scanning
  6. SageMaker — Model serving endpoint for real-time fraud predictions
  7. Lambda + EventBridge — Automated drift detection on a recurring schedule
  8. CloudWatch + SNS — Alarms, dashboards, and email alerting
Terraform VPC ECS Fargate RDS PostgreSQL S3 ECR SageMaker Lambda CloudWatch Secrets Manager Docker Python / MLflow NIST 800-53

VPC & Defense in Depth

The VPC design places the ALB in public subnets while keeping ECS, RDS, and Lambda in private subnets. An attacker must chain three exploits — compromise the ALB, pivot to ECS, then reach the database — instead of hitting the database directly. This is defense in depth per NIST SP 800-53 SC-7 (Boundary Protection).

terraform/vpc.tf
# Private Subnets — ECS, RDS, Lambda (no internet access)
resource "aws_subnet" "private" {
  count             = 2
  vpc_id            = aws_vpc.main.id
  cidr_block        = cidrsubnet(var.vpc_cidr, 8, count.index + 10)
  availability_zone = data.aws_availability_zones.available.names[count.index]

  tags = { Name = "${var.project_name}-private-${count.index + 1}" }
}
Why not put everything in one public subnet? Every resource gets a public IP and is reachable from the internet. If MLflow has a vulnerability, the attacker pivots directly to the database. With private subnets, the database has no public IP — the only path to it is through ECS containers behind the ALB.

MLflow on ECS Fargate

MLflow runs as a containerized service on ECS Fargate. The Docker image uses a multi-stage build to minimize attack surface, runs as a non-root user (mlflow:1001), and includes health checks for ALB integration. RDS PostgreSQL replaces SQLite for concurrent writes, and S3 replaces the local filesystem for durable, encrypted artifact storage.

docker/Dockerfile
FROM python:3.11-slim AS builder
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir mlflow==2.17.0 psycopg2-binary boto3

FROM python:3.11-slim AS runner
WORKDIR /app

# Non-root user for security
RUN useradd -r -u 1001 mlflow
USER mlflow

COPY --from=builder /usr/local/lib/python3.11/site-packages \
     /usr/local/lib/python3.11/site-packages
COPY --from=builder /usr/local/bin /usr/local/bin

EXPOSE 5000
HEALTHCHECK --interval=30s --timeout=5s --retries=3 \
  CMD python -c "import urllib.request; urllib.request.urlopen('http://localhost:5000/health')"

Drift Detection with Lambda

A Lambda function triggers on an EventBridge schedule to compare baseline prediction distributions against current SageMaker endpoint output. Two statistical tests determine drift: PSI (Population Stability Index) measures distribution shift, and a KS Test provides a non-parametric significance check. Drift events publish custom CloudWatch metrics and trigger alarms via SNS.

lambda/drift_monitor.py
def compute_psi(expected, actual, buckets=10):
    """Population Stability Index for drift detection."""
    breakpoints = np.percentile(expected, np.linspace(0, 100, buckets + 1))
    expected_pct = np.histogram(expected, breakpoints)[0] / len(expected)
    actual_pct   = np.histogram(actual, breakpoints)[0] / len(actual)
    return float(np.sum(
        (actual_pct - expected_pct) * np.log(actual_pct / expected_pct)
    ))

# Thresholds
DRIFT_THRESHOLD_PSI = 0.2
DRIFT_THRESHOLD_KS  = 0.05

ks_stat, ks_pvalue = stats.ks_2samp(baseline_scores, current_scores)
psi = compute_psi(baseline_scores, current_scores)
drift_detected = (ks_pvalue < DRIFT_THRESHOLD_KS) or (psi > DRIFT_THRESHOLD_PSI)

Least-Privilege IAM

Four scoped IAM roles ensure blast radius containment per NIST AC-6 (Least Privilege). A compromised Lambda can only access S3 and publish CloudWatch metrics — it cannot reach the database or modify ECS services. Each role has exactly the permissions it needs, nothing more.

  • ECS Execution Role — pulls images from ECR and reads Secrets Manager
  • ECS Task Role — accesses S3 artifacts and pushes CloudWatch logs
  • Lambda Execution Role — invokes SageMaker endpoint, publishes CloudWatch metrics
  • SageMaker Execution Role — reads model artifacts from S3

CloudWatch Monitoring & Alerting

A CloudWatch dashboard provides single-pane observability with four widgets: drift PSI over time, ECS CPU utilization, RDS CPU utilization, and ALB request count. Four alarms with SNS email notifications cover drift detection, ECS high CPU, RDS high CPU, and ALB 5XX errors.

terraform/cloudwatch.tf
resource "aws_cloudwatch_metric_alarm" "drift_detected" {
  alarm_name          = "${var.project_name}-drift-detected"
  comparison_operator = "GreaterThanOrEqualToThreshold"
  evaluation_periods  = 1
  metric_name         = "DriftDetected"
  namespace           = "CyberNova/MLOps"
  period              = 3600
  statistic           = "Maximum"
  threshold           = 1
  alarm_description   = "Model drift detected — PSI or KS test exceeded threshold"
  alarm_actions       = [aws_sns_topic.mlops_alerts.arn]
}

Security Assessment

The final deliverable is a PDF security assessment generated with fpdf2, identifying 6 findings mapped to compliance frameworks. Each finding includes severity, evidence, risk analysis, and specific remediation steps.

Assessment results: 2 Critical (ALB authentication, TLS), 2 High (Terraform state secrets, VPC Flow Logs), 2 Medium (SageMaker auto-scaling, SIEM integration). All findings mapped to NIST SP 800-53, CIS AWS Benchmark, and OWASP ML Top 10.

Compliance Framework Mapping

  • NIST SC-7 — Boundary Protection: VPC private subnets and security group chain
  • NIST AC-6 — Least Privilege: 4 scoped IAM roles with minimal permissions
  • NIST SC-28 — Protection of Information at Rest: S3 encryption + versioning
  • NIST AU-12 — Audit Record Generation: VPC Flow Logs (remediation finding)
  • NIST IR-6 — Incident Reporting: CloudWatch alarms + SNS notifications
  • NIST CM-2 — Baseline Configuration: All infrastructure managed via Terraform
  • CIS AWS 2.1.1 — S3 public access blocking and encryption
  • CIS AWS 2.3.3 — RDS not publicly accessible
  • OWASP ML05 — Supply Chain Poisoning: ECR immutable tags + vulnerability scanning

Lessons Learned

  1. Secrets Manager deletion protection — When you terraform destroy and re-apply, Secrets Manager keeps deleted secrets for 7 days. You must restore and import them before re-creating. This caught me off guard and is a real operational consideration for teardown/rebuild workflows.
  2. Shell variables do not persist — Commands like ECR_URL=$(terraform output -raw ecr_repository_url) must be re-run in every new terminal session. Terraform state persists on disk, but the shell variable only lives in memory.
  3. WSL native filesystem vs /mnt/c/ — Running Terraform and Docker from ~/ (native Linux filesystem) is significantly faster than /mnt/c/ (Windows mount). File permissions and symlinks also behave more predictably.
  4. Defense in depth is layered, not duplicated — The security group chain ALB(443) → ECS(5000) → RDS(5432) means each hop narrows the attack surface. An attacker must compromise three layers, not just one.
  5. Terraform handles dependency ordering automatically — You do not need to apply after each .tf file. Create all files first, then run a single terraform apply. Terraform builds the dependency graph and provisions resources in the correct order.

References

  1. AWS Documentation — Amazon VPC, ECS Fargate, RDS, S3, ECR, SageMaker, Lambda, CloudWatch
  2. HashiCorp — Terraform AWS Provider Documentation
  3. NIST — SP 800-53 Rev 5 Security and Privacy Controls
  4. CIS — Amazon Web Services Foundations Benchmark
  5. OWASP — Machine Learning Security Top 10
  6. MLflow — Tracking Server Documentation