Samuel McQueen | AI-Enhanced Zero Trust: Anomaly Detection & Incident Response

This module adds an AI/ML detection layer on top of a production Zero Trust Architecture. The core problem: CloudWatch alarms fire on static thresholds, which means a misconfigured service and a real credential-theft attack look identical. SOC analysts waste hours manually correlating OPA decision logs to distinguish the two. This project automates that triage with an Isolation Forest model that learns what "normal" looks like and flags behavioral deviations, then hands anomalies to Amazon Bedrock Claude for structured incident analysis.

Real-world context: The 2024 Snowflake breach used stolen credentials to perform bulk data exports at unusual hours from unknown IPs. Threshold-based monitoring missed it entirely because individual request volumes stayed below alarm limits. Behavioral anomaly detection would have flagged the access pattern immediately.

Architecture Overview

The AI layer reads from the existing ZTA stack but never modifies it. OPA decision logs feed into the Isolation Forest for scoring. When anomalies exceed a threshold, Bedrock Claude generates a SOC-ready incident report. CloudWatch receives custom metrics for dashboarding and alarming. This "observe-only" design means the AI layer adds detection capability without introducing new attack surface.

Feature Engineering

The detector extracts 7 features from each OPA decision log entry. The key insight is that each feature captures a different dimension of "normal" behavior, so deviations along any axis signal a potential threat.

anomaly_detector.py — Feature Extraction

def _extract_features(self, entry):
    # 7 features per OPA decision log entry
    return [
        entry["timestamp"].hour,          # hour_of_day  (0-23)
        self.method_enc.transform([        # method       (GET/POST/PUT/DELETE)
            entry["input"]["method"]
        ])[0],
        self.role_enc.transform([           # role         (analyst/admin/service)
            entry["input"]["role"]
        ])[0],
        1 if entry["result"] else 0,   # result       (allowed=1, denied=0)
        int(hashlib.md5(                    # ip_hash      (normalized MD5)
            entry["input"]["source_ip"].encode()
        ).hexdigest()[:8], 16) / 0xFFFFFFFF,
        len(entry["input"]["path"]),        # path_length  (URL length)
        1 if path in KNOWN_PATHS else 0  # is_known_path (whitelist check)
    ]

Anomaly Detection: Isolation Forest

Isolation Forest is an unsupervised algorithm that isolates anomalies by randomly partitioning feature space. Normal points require many splits to isolate; anomalies require few. The model trains on 3,500 "normal" OPA log entries and assigns each new entry an anomaly score. Entries scoring below the contamination threshold are flagged.

anomaly_detector.py — Model Training

from sklearn.ensemble import IsolationForest

self.model = IsolationForest(
    n_estimators=200,          # 200 isolation trees
    contamination=0.07,        # expect 7% anomalous data
    random_state=42,           # reproducible results
    n_jobs=-1                   # use all CPU cores
)

# Train on historical normal traffic
self.model.fit(feature_matrix)
# Baseline: min=-0.03, max=0.29, mean=0.18

Tuning the contamination parameter: Starting at 0.005 gave only 7% detection. Incrementally raising it to 0.07 achieved 80%+ detection across all 5 anomaly categories. In production, you would tune this against labeled incident data to balance false positives against missed detections.

Anomaly Categories Detected

Privilege escalation — analyst issuing DELETE requests outside their role
Credential theft — valid user from an unknown IP address
Unusual hours — 3 AM access from a 9-to-5 user
Burst denials — rapid-fire denied requests (brute-force scanning)
Suspicious admin — admin from a new IP at an unusual hour

Bedrock Claude Incident Response

When the pipeline detects 3+ anomalies in a scoring window, it packages the flagged entries and sends them to Claude 3 Haiku via Amazon Bedrock with a structured SOC analyst system prompt. Claude returns a formatted incident report covering severity, root cause, affected assets, recommended actions, and indicators of compromise.

incident_responder.py — Bedrock Claude Integration

self.bedrock = boto3.client("bedrock-runtime", region_name="us-east-1")

response = self.bedrock.invoke_model(
    modelId="anthropic.claude-3-haiku-20240307-v1:0",
    body=json.dumps({
        "anthropic_version": "bedrock-2023-05-31",
        "system": "You are a Senior SOC Analyst specializing in Zero Trust...",
        "messages": [{"role": "user", "content": anomaly_summary}],
        "max_tokens": 1500,
        "temperature": 0.1   # low temperature for consistent analysis
    })
)

Cost efficiency: Each incident analysis costs $0.01–$0.05 using Claude 3 Haiku. At even 10 incidents per day, the monthly Bedrock cost stays under $15 — far less than the analyst hours saved by automated triage.

CloudWatch Integration

Two custom metrics are pushed to the FedSecure/ZeroTrust namespace: AnomalyScore (count of anomalies) and AnomalyRate (percentage flagged). A CloudWatch alarm fires when AnomalyScore exceeds 3 in a 5-minute window, which can trigger SNS notifications to the on-call team.

Pipeline Orchestration

The zt_ai_pipeline.py script chains all stages: load the trained model, score recent OPA logs, push metrics to CloudWatch, and (if anomalies cross the threshold) trigger Bedrock Claude analysis and generate a timestamped PDF incident report. A single command runs the full detect-to-report workflow.

Security Assessment

A self-assessment of the AI layer identifies 5 findings (2 Medium, 3 Low) covering synthetic training data risk, missing drift detection, lack of human-in-the-loop review, limited feature set, and on-demand (vs. event-driven) pipeline execution. Each finding includes severity, impact, and recommended remediation.

Technologies

scikit-learn pandas Amazon Bedrock CloudWatch Python 3.12 boto3 fpdf2 OPA / Rego Zero Trust Docker

Lessons Learned

Contamination tuning is everything. The default 0.005 detected only 7% of anomalies. Incrementally testing 0.01, 0.02, 0.05, 0.06, 0.07 found the sweet spot at 80%+ detection without flooding false positives.
Feature engineering beats model complexity. Seven well-chosen features from raw JSON logs outperformed more complex approaches. The IP hash and is_known_path features alone caught credential theft and reconnaissance.
Bedrock Claude needs low temperature for SOC work. Temperature 0.1 produces consistent, structured reports. Higher values introduced creative but unreliable severity assessments.
fpdf2 chokes on Unicode. Em dashes and smart quotes in Claude's output caused latin-1 encoding errors. Sanitizing Unicode to ASCII before PDF rendering fixed it.
WSL pip is externally managed. Python packages must be installed in a venv or with --break-system-packages. PowerShell pip worked without this restriction.
The AI layer must be read-only. It reads OPA logs and writes to CloudWatch/PDF, but never modifies the ZTA pipeline. This "observe-only" constraint prevents the AI from becoming a new attack surface.

References

scikit-learn Isolation Forest documentation
Amazon Bedrock Claude API reference
NIST SP 800-207 Zero Trust Architecture
2024 Snowflake breach analysis — credential theft via anomalous bulk exports
CloudWatch custom metrics and alarms documentation

AI-Enhanced Zero Trust: Anomaly Detection & Automated Incident Response

Architecture Overview

Feature Engineering

Anomaly Detection: Isolation Forest

Anomaly Categories Detected

Bedrock Claude Incident Response

CloudWatch Integration

Pipeline Orchestration

Security Assessment

Technologies

Lessons Learned

References