Introduction
I’ve been exploring how far we can push fully autonomous, multi-agent investigations in real SRE environments — not as a theoretical exercise, but using actual Kubernetes clusters and real tooling. Each agent in this experiment operated inside a sandboxed environment with access to Kubernetes MCP for live cluster inspection and GitHub MCP to analyze code changes and even create remediation pull requests.
The goal wasn’t simply to generate summaries, but to observe whether agents can behave like an actual on-call team: investigate an alert, validate each other’s findings, propose mitigation steps, and — when appropriate — submit a PR that teammates can review and approve.
What emerged is a surprisingly capable prototype of an agent-driven incident response workflow, where one agent investigates, another validates the findings, and a third reviews the mitigation plan and decides whether to escalate or authorize automated actions.
The Experiment
The setup is simple in concept. I split the work across three agents:
- Receiver: → On-call engineer who receives the alert and executes the first steps
- Reviewer 1 → First Teammate On-call reviewing the investigation
- Reviewer 2 → Second Teammate On-call engineer providing final assessment and Decision Maker
This mimics the dynamics of a real SRE team handling an alert: investigation, peer review, and final decision.

Experiment Goals
The key questions I want to answer are:
- Can autonomous agents meaningfully triage incidents end-to-end?
- Do they catch each other's mistakes, improving reliability compared to a single agent?
- Can they reduce alert fatigue by escalating only when it matters?
I expect that they can go surprisingly far — but the gaps will also be obvious once real-world complexity comes into play.
Technical Setup
To ensure complete isolation and safety, I run the agents inside Docker containers. Each agent gets full tool access within the container using --dangerously-skip-permissions, which is safe since the container is isolated from the host system. The agents run Claude in non-interactive mode with different CLAUDE.md prompt files that define their roles.
Agent Communication via Local Files
The agents communicate through a sequential file chain:
- Agent 1 (Receiver):
- Reads:
alert.json(the original alert) - Role: On-call engineer conducting initial investigation using Kubernetes MCP server
- Writes:
investigation-1.mdwith findings, root cause analysis, and mitigation recommendations
- Reads:
- Agent 2 (Reviewer 1):
- Reads:
alert.jsonandinvestigation-1.md - Role: First teammate reviewing the investigation
- Writes:
investigation-2.mdwith validation, corrections, and additional insights
- Reads:
- Agent 3 (Reviewer 2):
- Reads:
alert.json,investigation-1.md, andinvestigation-2.md - Role: Second teammate providing final assessment
- Writes:
investigation-3.mdwith consolidated mitigation plan and escalation decision
- Reads:
Each agent runs with its own CLAUDE.md file that defines their specific role:
CLAUDE-receiver.md— defines the on-call engineer roleCLAUDE-reviewer1.md— defines the first reviewer roleCLAUDE-reviewer2.md— defines the second reviewer role
Container Setup
The containerized environment includes all necessary tools and authentication:
Dockerfile:
FROM node:20-slim
# Install dependencies
RUN apt-get update && apt-get install -y \
curl \
git \
uuid-runtime \
&& rm -rf /var/lib/apt/lists/*
# Install kubectl
RUN curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl" \
&& chmod +x kubectl \
&& mv kubectl /usr/local/bin/
# Install gh CLI
RUN curl -fsSL https://cli.github.com/packages/githubcli-archive-keyring.gpg | dd of=/usr/share/keyrings/githubcli-archive-keyring.gpg \
&& chmod go+r /usr/share/keyrings/githubcli-archive-keyring.gpg \
&& echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/githubcli-archive-keyring.gpg] https://cli.github.com/packages stable main" | tee /etc/apt/sources.list.d/github-cli.list > /dev/null \
&& apt-get update \
&& apt-get install -y gh
# Install Claude Code
RUN npm install -g @anthropic-ai/claude-code
WORKDIR /workspace
CMD ["/bin/bash"]
docker-compose.yml:
version: '3.8'
services:
sre-agent:
build: .
volumes:
- ./:/workspace
- ~/.kube:/root/.kube:ro # Kubernetes config
- ~/.config/gh:/root/.config/gh:ro # GitHub CLI auth
- ~/.config/claude:/root/.config/claude:ro # Claude auth
network_mode: host # Access to local k8s cluster
run-investigation.sh:
#!/bin/bash
# Run agent 1 with isolated session
cat CLAUDE-receiver.md | claude -p --session-id $(uuidgen) --dangerously-skip-permissions > investigation-1.md
# Run agent 2 with isolated session
cat CLAUDE-reviewer1.md | claude -p --session-id $(uuidgen) --dangerously-skip-permissions > investigation-2.md
# Run agent 3 with isolated session
cat CLAUDE-reviewer2.md | claude -p --session-id $(uuidgen) --dangerously-skip-permissions > investigation-3.md
Usage:
docker-compose build
docker-compose run sre-agent bash run-investigation.sh
This pattern ensures each agent:
- Has the full context of previous work through the sequential file chain
- Runs in complete isolation with unique session IDs
- Has safe, unrestricted tool access within the container
- Can access Kubernetes and GitHub with your local credentials
- Maintains a clear audit trail of the investigation process
The containerized approach provides flexibility to:
- Control information flow between agents through file-based communication
- Expand easily with new tools or checks without affecting the host system
- Reuse the structure for other workflows like CI/CD validation or automated rollbacks
- Run experiments safely without risking production systems
Experimental Results
⚠️ CRITICAL DISCLAIMER: This is a completely sandboxed experiment. All incidents, investigations, and remediation actions occurred in an isolated test environment with no production impact.
I conducted 7 documented experiments across 3 distinct incident types. All runs were successful in identifying root causes and proposing remediation actions.
Experiment 1: Init Container CrashLoopBackOff
Incident: Pod stuck in perpetual crash state for 8 days
Alert: KubePodNotReady for failing-init-demo-5999d7545c-sh4zk
Investigation:
- Agents identified hardcoded
exit 1In the init container - Verified 4,512 restarts over 16 days
- Proposed deletion as primary mitigation
Unexpected Discovery: The final reviewer discovered this was an intentional test deployment - the very test case for the investigation system itself! The agents were investigating their own test data without realizing it.
Key Learning: Agents lack meta-awareness. They didn't question why a deployment named "failing-init-demo" existed or check project documentation. This revealed a critical gap: investigators need an explicit "context gathering" phase before diving into technical diagnosis.
Experiment 2: Ingress Misconfiguration (HTTP 404)
Incident: Application returning 404 errors for 11 hours
the Root Cause: AWS ALB configured with HTTPS on port 444 instead of the standard port 443
# Incorrect
alb.ingress.kubernetes.io/listen-ports: '[{"HTTP": 80}, {"HTTPS": 444}]'
Investigation Quality:
- Agent 1: Correctly identified the ,root cause with definitive evidence
- Agent 2: Enhanced safety analysis, identified shared ALB risk affecting 5 other services
- Agent 3: Authorized auto-remediation with comprehensive monitoring requirements
Output: Final investigation (Agent 3) was 871 lines of comprehensive incident analysis,a including mitigation plan, rollback procedures, monitoring requirements, and long-term prevention strategies (Kyverno policy recommendations).
Experiments 3-7: Database Configuration Error Series
Incident Type: Application crash due to a non-existent database
Root Cause: Database name changed to evershop-update-2025 which doesn't exist
These experiments tested different system capabilities:
Experiment 3: Investigation without GitHub MCP access
- Agents identified the crash but couldn't trace it to the source code changes
- Demonstrated the critical importance of repository access
Experiment 4: JSONL output format testing
- Revealed excessive verbosity (large, hard-to-parse files)
- Led to format simplification
Experiment 5: Full investigation with all access
- Successfully traced the issue to the commit
39920dacmade 11 minutes before the alert - Created remediation PR
- 99% confidence in root cause
Timeline correlation:
09:14:34 UTC - Commit merged
09:24:26 UTC - Deployment with bad config
09:25:25 UTC - Alert fires
6+ hours - Service down, 75 pod restarts
Experiment 6: Action logging improvement
- The investigation was correct, but lacked documentation of HOW agents investigated
- Led to prompt changes requiring explicit "Investigation Actions" sections
- Now, agents document: "Ran PostgreSQL pod via Kubernetes MCP to verify databases" instead of just "Database verified."
Experiment 7: Silent failure case
- Critical reliability issue: Agent 2 (Reviewer 1) produced only a newline, no investigation output
- Agent 1 successfully created PR #6 and wrote a complete investigation (23 lines)
- Agent 2 failed silently - output file contained only 1 byte (newline character)
- Agent 3 ran automatically, but only had Agent 1's output available (Agent 2's was empty)
- Agent 3 still correctly reviewed PR #6 and provided a valid assessment
- A 4th agent was run manually to verify the investigation chain
- Root cause: Agent 2 executed but produced no text output (silent failure)
- Impact: Subsequent agents in the pipeline lacked intermediate review, though the final result was still valid
- Lessons: Identified need for output validation, real-time visibility, and fail-fast behavior
System Evolution
Through these experiments, the system evolved significantly:
Output Format Evolution
Initial (Experiments 1-4): 7-section verbose format
1. Incident Summary
2. Investigation Findings
3. Root Cause
4. Immediate Mitigation
5. Long-term Prevention
6. Monitoring Plan
7. Pull Requests Links
Current (Experiments 6-7): Concise, action-focused format
1. Investigation Actions (tools/methods used)
2. Root Cause (with file:line references)
3. Immediate Actions (critical steps only)
4. PR Links
Current State: The sandbox script remains simple for experimentation. Production implementation would include these safety features.
Key Findings
investigation's
- Root Cause Identification: 100% success rate across all experiments
- Tool Proficiency: Excellent use of Kubernetes and GitHub APIs
- Multi-Agent Value: Peer review consistently improved quality and safety
- Autonomous Remediation: Successfully created appropriate PRs with fixes
- Evidence-Based Analysis: Strong technical reasoning with specific code references
Critical Issues ⚠️
- Silent Failures: Agents can complete work without producing the required outputs
- Meta-Context Blindness: Don't question the investigation's purpose or check documentation
- Verbosity Control: Initial formats consumed excessive tokens
- Process Transparency: Agents didn't document their investigation methods
Production Readiness Gap
This is a sandbox experiment. Significant engineering work remains for production deployment:
Missing Components
Security:
- No authentication/authorization layer
- No audit logging
- Kubernetes access is cluster-admin (overly permissive)
Safety:
- No dry-run mode
- No blast radius analysis
- No automated rollback
- No rate limiting or circuit breakers
Integration:
- No PagerDuty/incident management integration
- No Slack notifications
- No metrics/observability for agent performance
- No SLA tracking
Reliability:
- No retry logic for transient failures
- No concurrency control (parallel alerts)
- No queue management
- No cost controls (unlimited API calls)
- No timeout handling
Operations:
- No runbooks for agent failures
- No monitoring of agent health
- No disaster recovery plan
- No multi-tenancy support
Cost Analysis (Estimated)
Per-Investigation Estimate (using Claude Sonnet-class pricing):
- Simple incidents: ~50-80K tokens total across 3 agents
- Complex incidents: ~100-150K tokens total across 3 agents
- Actual cost varies by model (Haiku: $0.10-0.30, Sonnet: $0.50-2.00, Opus: $5-15 per investigation)
Sandbox Experience:
- This experiment used Claude Sonnet (mid-tier pricing)
- Token usage averaged 40-100K per full investigation cycle
- Cost per investigation: Estimated $1-5 (sandbox scale)
At Scale Estimate:
- 100 high-priority alerts/day with automated investigation
- Using hybrid model (Haiku for Agent 1, Sonnet for Agent 3)
- Estimated: $10-50K/month depending on alert volume and model selection
- Critical: Aggressive filtering needed to investigate only high-value alerts
Timeline to Production
Estimated effort: 6-12 months with a dedicated team
- Months 1-2: Security hardening, authentication, authorization
- Months 3-4: Production integrations (PagerDuty, Slack, metrics)
- Months 5-6: Reliability engineering (retries, timeouts, queuing)
- Months 7-8: Cost controls and optimization
- Months 9-10: Multi-tenancy and scale testing
- Months 11-12: Beta program with read-only investigations
Statistical Summary
| Metric | Value |
|---|---|
| Total Experiments | 7 |
| Unique Incident Types | 3 |
| Root Cause Success Rate | 100% |
| PRs Created by Agents | 3+ (verified: #2, #5, #6) |
| Total Investigation Output | 4,333 lines |
| Avg Investigation Time | 2-6 hours |
| Silent Failures Discovered | 1 |
| Format Iterations | 3 |
Conclusion
This experiment successfully demonstrates that multi-agent AI systems can autonomously:
- ✅ Investigate Kubernetes incidents with real cluster access
- ✅ Identify root causes with 99-100% confidence
- ✅ Create pull requests with appropriate fixes
- ✅ Provide valuable peer review and safety validation
- ✅ Make escalation decisions
However, this is nowhere near production-ready.
The experiment validates technical feasibility but reveals the productionization gap is substantial. The agents are excellent technical investigators but lack the safety controls, reliability engineering, and operational maturity required for production incident response.
Appendix: Experiment Files
All 7 experiments are documented in snapshots/:
snapshots/
├── 1-init-container-crashloop/ # Intentional test case discovery
├── 2-wrong-port-ingress-alb/ # HTTP 404 due to port 444
├── 3-wrong-db-missing-mcp/ # Investigation without GitHub access
├── 4-wrong-db-jsonl-missing-mcp/ # Format testing
├── 5-wrong-db-JSONL/ # Full successful investigation
├── 6-good-but-not-enough-actions-logs/ # Action logging improvement
└── 7-missing-review1-text-outputs/ # Silent failure case
Each directory contains:
alert.json- Original Prometheus alertinvestigation-{1,2,3}.md- Agent outputs- Supporting files (backups, PR references)
Github Repository URL: github.com/opsworker-ai/agentic-sre-investigations-poc
Subscribe to our email newsletter and unlock access to members-only content and exclusive updates.
Comments