Skip to main content

Root Cause Analysis

Overview

OpsWorker identifies the underlying cause of an alert — not just the symptom. By correlating data across multiple Kubernetes resources, the AI distinguishes between surface-level indicators and the actual root cause.

How Root Cause Analysis Works

Beyond the Alerting Resource

When a pod crashes, the problem may not be in the pod itself. OpsWorker's topology discovery traces dependencies to find where the issue originates:

AlertSurface SymptomRoot Cause Found By OpsWorker
Pod CrashLoopBackOffPod keeps restartingOOM killed — memory limit 256Mi but app needs 512Mi
Service 503 errorsService returning errorsDeployment selector mismatch — pods not matching service selector after label change
Ingress timeoutIngress health check failingBackend service has 0 healthy endpoints — all pods in ImagePullBackOff
High CPU alertCPU utilization 95%HPA maxReplicas reached — manual scaling or limit increase needed

Correlation Across Resources

OpsWorker correlates signals from multiple sources:

  • Logs — Application errors, stack traces, connection failures
  • Events — Kubernetes events (scheduling, OOM kills, image pulls, probe failures)
  • Configuration — Resource limits, selectors, environment variables
  • Topology — How resources are connected (pod → service → ingress chain)

Issue Classification

Each investigation classifies the issue:

  • Configuration issue — Misaligned selectors, incorrect resource limits, missing environment variables
  • Runtime issue — Application crash, memory leak, external dependency failure

This classification guides the type of recommendations generated.

Confidence Levels

OpsWorker provides a confidence level with each root cause analysis, indicating how certain the AI is about its findings. Higher confidence comes from:

  • Clear error signals in logs or events
  • Direct correlation between alert and observed behavior
  • Configuration issues that can be definitively identified

What You Get

Each root cause analysis includes:

  1. Root cause statement — A clear explanation of what went wrong and why
  2. Evidence — The specific data points that support the conclusion (log lines, events, config values)
  3. Affected resources — All resources involved in the issue, not just the alerting one
  4. Confidence level — How certain the AI is about the diagnosis

Next Steps