Investigate Production Incidents
Scenario
It's 3 AM. A critical alert fires — CrashLoopBackOff on the payment service pod. The on-call engineer's phone buzzes.
Without OpsWorker: The engineer wakes up, opens their laptop, connects to VPN, runs kubectl to check pod status, reads logs, checks events, inspects the deployment config, traces service dependencies, and finally understands the issue. 45 minutes later, they know the root cause and can start fixing it.
With OpsWorker: The alert fires, OpsWorker investigates automatically, and a Slack message arrives within 2 minutes with the root cause and fix commands. The engineer reviews the summary and applies the fix.
How OpsWorker Helps
- Alert arrives from Prometheus AlertManager
- OpsWorker investigates automatically — no human trigger needed
- Topology discovery finds the payment pod, its deployment, the related service, and the ingress
- Data collection gathers logs (OOM pattern in the last 50 lines), events (OOMKilled restart), and deployment config (256Mi memory limit)
- AI analysis identifies: "OOMKilled — memory limit of 256Mi exceeded. Logs show connection pool growing unbounded starting at 02:15 UTC."
- Recommendations delivered to Slack:
- Immediate:
kubectl set resources deployment/payment-api -n production --limits=memory=512Mi - Preventive: Investigate connection pool leak in the application config
- Immediate:
Outcome
- MTTR reduced from 45 minutes to under 5 minutes (2 min investigation + engineer review)
- Root cause identified — not just "pod crashed" but "OOM from connection pool leak"
- Actionable fix — specific kubectl command, not generic advice
- Engineer stays informed — can chat with the investigation for more details