Root Cause Analysis
Overview
OpsWorker identifies the underlying cause of an alert — not just the symptom. By correlating data across multiple Kubernetes resources, the AI distinguishes between surface-level indicators and the actual root cause.
How Root Cause Analysis Works
Beyond the Alerting Resource
When a pod crashes, the problem may not be in the pod itself. OpsWorker's topology discovery traces dependencies to find where the issue originates:
| Alert | Surface Symptom | Root Cause Found By OpsWorker |
|---|---|---|
| Pod CrashLoopBackOff | Pod keeps restarting | OOM killed — memory limit 256Mi but app needs 512Mi |
| Service 503 errors | Service returning errors | Deployment selector mismatch — pods not matching service selector after label change |
| Ingress timeout | Ingress health check failing | Backend service has 0 healthy endpoints — all pods in ImagePullBackOff |
| High CPU alert | CPU utilization 95% | HPA maxReplicas reached — manual scaling or limit increase needed |
Correlation Across Resources
OpsWorker correlates signals from multiple sources:
- Logs — Application errors, stack traces, connection failures
- Events — Kubernetes events (scheduling, OOM kills, image pulls, probe failures)
- Configuration — Resource limits, selectors, environment variables
- Topology — How resources are connected (pod → service → ingress chain)
Issue Classification
Each investigation classifies the issue:
- Configuration issue — Misaligned selectors, incorrect resource limits, missing environment variables
- Runtime issue — Application crash, memory leak, external dependency failure
This classification guides the type of recommendations generated.
Confidence Levels
OpsWorker provides a confidence level with each root cause analysis, indicating how certain the AI is about its findings. Higher confidence comes from:
- Clear error signals in logs or events
- Direct correlation between alert and observed behavior
- Configuration issues that can be definitively identified
What You Get
Each root cause analysis includes:
- Root cause statement — A clear explanation of what went wrong and why
- Evidence — The specific data points that support the conclusion (log lines, events, config values)
- Affected resources — All resources involved in the issue, not just the alerting one
- Confidence level — How certain the AI is about the diagnosis
Next Steps
- Suggested Fixes — Remediation steps generated from root cause
- Command Generation — kubectl commands for fixes