Last week, we shared a walkthrough of the OpsWorker.ai onboarding experience — connecting clusters, configuring alerting, and integrating Slack so teams can see investigation results right inside their natural workflow.

This week, we’re shifting from setup to real incident analysis.
The goal: show how OpsWorker acts as an AI SRE Agent, diagnosing real-world failures that engineers deal with every day. No synthetic examples, no artificial logs — just two classic failure modes that happen constantly in cloud-native systems.

These simulations highlight how OpsWorker reduces MTTR, cuts down on on-call fatigue, and minimizes the cross-team back-and-forth that usually slows incident response.

0:00

/6:39

Alert Rules & Setup Before the Simulation

Before diving into the incidents, we configured OpsWorker to automatically investigate only alerts coming from the evershop namespace — a Node.js–based shop application using AWS RDS.
This ensures the system stays focused on what matters instead of burning cycles on noisy development alerts.

Once the alert rules were in place, we moved to the first simulation.

Incident #1 — CPU Saturation & HPA Maxing Out

To recreate a real production outage, we artificially pushed CPU usage high enough to hit the HPA ceiling.
We intentionally set the max replicas to 2, making it easier to reach resource exhaustion quickly. Running a CPU-filling request (fill-cpu) pushed the service over the edge — and the evershop application dropped.

Alerts:

KubeHpaMaxedOut
CPUThrottlingHigh (in some cases if we run /api/fill-cpu?c=100)


helm upgrade --install evershop  chart/ --set dns.prefix="evershop"  --set global.resources.limits.cpu="200m"   --namespace evershop --create-namespace


# wait 2 min to normalize CPU usage 
kubectl autoscale deployment evershop --cpu-percent=5 --min=1 --max=2 -n evershop

Since container is limited to 200m (millicores), setting the threshold at 5% (10m) ensures HPA reacts when usage hits ~10m.(usually it's close to 30m)

Step 2: In case you still dont see get Alert KubeHpaMaxedOut alert

# Utilizes approximately 100 millicores (0.1 CPU) using a timed busy loop.
curl https://everhop.tk8s.opsworker.ai/api/fill-cpu?c=100

the special API /api/fill-cpu in evershop is customized code added by OpsWorker Team

How to check manually

  kubectl get hpa -n evershop

A few minutes later, Alertmanager fired.

OpsWorker received the alert and, because it matched our rules, automatically launched an investigation.

What OpsWorker Found

The AI SRE Agent traced the alert back to the deployment and highlighted the critical signals:

Error: HPA maxed out for 15+ minutes, indicating sustained high load or incorrect scaling metrics.

Root Cause: Application pods are likely experiencing resource exhaustion (OOMKilled/CPU throttling) or critical runtime errors causing them to fail at serving requests, triggering continuous HPA scale-up without resolving the underlying performance bottleneck.

Recommendation: Immediately check evershop pod resource usage, restart counts, and exit codes to identify CPU throttling conditions for effectively handling traffic. Increase the mumbr of HPA max replicas from 2 to 10.

Reasoning: HPA sustained at max replicas indicates pods cannot handle load despite scaling. Most probable causes per Priority l: pods hitting memory limits (exit 157), CPU throttling, or crashing with high restart counts. This creates a cycle where ne ods also fait keeping HPA maxed. Log fetch targeted wrong pod (metrics exporter vs application), but the pattern strongly suggests resource exhaustion preventing the application from serving traffic effectively even as replicas

Root Cause Identified

The workload hit CPU saturation, and HPA reached the configured max replica limit.

Incident #2 — Memory Leak, OOM Crash & Liveness Probe Failures

Next, we simulated a memory leak that forces the pod into OOMKill, followed by liveness probe failures and cascading 503 errors — a problem many teams have had to chase manually.

After triggering the scenario, evershop went down again.
Alertmanager fired almost immediately with 503 failures, and OpsWorker started its investigation.

Alerts:

CrushLoopback
In case User have Container OOMKill alert, then it also may come.
Blackbox Exporter endpoint unreachable, 502 Error code, 503 Error code

Steps to simulation:


#!/bin/bash
DNS_PREFIX="evershop"
ITERATIONS=5
URL="https://${DNS_PREFIX}.tk8s.opsworker.ai"
for ((i=1;i<=ITERATIONS;i++)); do
  curl -s "$URL/api/oom-memory-kill"
  sleep 10
  [ "$(curl -s -o /dev/null -w "%{http_code}" "$URL")" = "200" ] && curl -s "$URL/api/oom-memory-kill"
done

helm upgrade --install evershop  chart/ --set dns.prefix="evershop"   --namespace evershop --create-namespace

What OpsWorker Detected

The AI SRE Agent pulled together signals from multiple layers:

Error: Container repeatedly killed with exit code 137 after -50 second [runtime (12 restarts)

Root Cause: Out of Memory (OOM) - container exceeds 512Mi memory limit. The /api/ready endpoint (configured with 2-second timeout) consistently fails and application cannot respond within the timeout window

Recommendation: Increase memory limit to at least 1Gi and requests to 512Mi, then monitor actual usage. Improve liveness probe thresholds

Reasoning: Exit code 137 (SIGKILL) with high restart count and short runtime (-50s) is classic OOM pattern; container likely spikes memory during startup before being killed by

Root Cause Identified

A memory leak triggered OOMKill, causing pod restarts and repeated readiness probe failures.

Why This Matters for MTTR, On-Call Load & Cross-Team Handoffs

These two examples highlight the real value of an AI SRE Agent:

MTTR drops dramatically, because the system correlates signals instantly
SREs get answers, not dashboards
Developers ("you build it, you run it") get clarity, not guesswork
Cross-team communication shrinks, since OpsWorker brings everything into one place
On-call becomes more predictable, because the agent handles the first layers of early-stage triage

This is what it looks like when AI starts taking over the ops work nobody wants to do in the first place.

What’s Next

Next week, we’ll go even deeper.

We’re publishing an article exploring Agent-Driven SRE Investigations — where instead of using a single assistant, we simulate a coordinated team of AI agents working together like a real SRE team.

This is built on top of:

Kubernetes MCP
GitHub MCP

We’ll share what works, what breaks, and what it takes to design multi-agent workflows that go beyond reaction and move toward true collaborative root cause resolution.

Stay tuned — and as always, we welcome your feedback.
OpsWorker is improving every week thanks to it.

Tagged in:

Product News

How OpsWorker.ai - AI SRE Agent Investigates Real Kubernetes Incidents — Demo 2

Alert Rules & Setup Before the Simulation

Incident #1 — CPU Saturation & HPA Maxing Out

How to check manually

What OpsWorker Found

Root Cause Identified

Incident #2 — Memory Leak, OOM Crash & Liveness Probe Failures

Alerts:

Steps to simulation:

What OpsWorker Detected

Root Cause Identified

Why This Matters for MTTR, On-Call Load & Cross-Team Handoffs

What’s Next

About the Author

Ar Hakboian

Check latest articles from this author:

OpsWorker.ai Joins NVIDIA Inception — Building AI SRE for Production-Grade Workloads

From Blueprint to AI Code: What 20 Years of Architecture Taught Me About Building AI Products

OpsWorker AI SRE Intelligent Investigation: From Alert to Root Cause in Minutes

Comments

Previous Article

From Configuration to Clarity: What’s New in OpsWorker.ai — Our AI SRE Agent for Kubernetes

Next Article

Agent-Driven SRE Investigations: A Practical Deep Dive into Multi-Agent Incident Response

AIOps Solutions: Tools, Platforms, and Vendors Explained

OpsWorker.ai Joins NVIDIA Inception — Building AI SRE for Production-Grade Workloads

Benefits of Using an Incident Management Tool

Press ESC to close

Or check our Popular Categories...

Alert Rules & Setup Before the Simulation

Incident #1 — CPU Saturation & HPA Maxing Out

How to check manually

What OpsWorker Found

Root Cause Identified

Incident #2 — Memory Leak, OOM Crash & Liveness Probe Failures

Alerts:

Steps to simulation:

What OpsWorker Detected

Root Cause Identified

Why This Matters for MTTR, On-Call Load & Cross-Team Handoffs

What’s Next

Like what you read?

Subscribe to our Newsletter

About the Author

Check latest articles from this author:

Comments

Related Articles

Previous Article

Next Article