Root Cause Analysis

Overview

OpsWorker identifies the underlying cause of an alert — not just the symptom. By correlating data across multiple Kubernetes resources, the AI distinguishes between surface-level indicators and the actual root cause.

How Root Cause Analysis Works

Beyond the Alerting Resource

When a pod crashes, the problem may not be in the pod itself. OpsWorker's topology discovery traces dependencies to find where the issue originates:

Alert	Surface Symptom	Root Cause Found By OpsWorker
Pod CrashLoopBackOff	Pod keeps restarting	OOM killed — memory limit 256Mi but app needs 512Mi
Service 503 errors	Service returning errors	Deployment selector mismatch — pods not matching service selector after label change
Ingress timeout	Ingress health check failing	Backend service has 0 healthy endpoints — all pods in ImagePullBackOff
High CPU alert	CPU utilization 95%	HPA maxReplicas reached — manual scaling or limit increase needed

Correlation Across Resources

OpsWorker correlates signals from multiple sources:

Logs — Application errors, stack traces, connection failures
Events — Kubernetes events (scheduling, OOM kills, image pulls, probe failures)
Configuration — Resource limits, selectors, environment variables
Topology — How resources are connected (pod → service → ingress chain)

Issue Classification

Each investigation classifies the issue:

Configuration issue — Misaligned selectors, incorrect resource limits, missing environment variables
Runtime issue — Application crash, memory leak, external dependency failure

This classification guides the type of recommendations generated.

Confidence Levels

OpsWorker provides a confidence level with each root cause analysis, indicating how certain the AI is about its findings. Higher confidence comes from:

Clear error signals in logs or events
Direct correlation between alert and observed behavior
Configuration issues that can be definitively identified

What You Get

Each root cause analysis includes:

Root cause statement — A clear explanation of what went wrong and why
Evidence — The specific data points that support the conclusion (log lines, events, config values)
Affected resources — All resources involved in the issue, not just the alerting one
Confidence level — How certain the AI is about the diagnosis

Next Steps

Suggested Fixes — Remediation steps generated from root cause
Command Generation — kubectl commands for fixes

Overview​

How Root Cause Analysis Works​

Beyond the Alerting Resource​

Correlation Across Resources​

Issue Classification​

Confidence Levels​

What You Get​

Next Steps​