Alert Correlation
Overview
When a single issue causes multiple alerts, OpsWorker correlates them to avoid redundant investigations and provide a unified view of the problem.
How Correlation Works
Topology-Based Correlation
During investigation, OpsWorker discovers the dependency chain for the alerting resource. If multiple alerts fire for resources in the same chain, they're connected:
graph TD
A["Alert: Pod CrashLoopBackOff"] --> RC[Root Cause]
B["Alert: Service 503 errors"] --> RC
C["Alert: Ingress timeout"] --> RC
RC["Deployment misconfiguration<br/>(single root cause)"]
A pod crash causes service endpoint loss, which causes ingress errors. Three alerts, one root cause.
Timeline Correlation
Alerts that fire within a similar timeframe for related resources are identified as potentially linked. This helps the AI focus on a common root cause rather than investigating each symptom independently.
Benefits
| Without Correlation | With Correlation |
|---|---|
| 3 separate alerts → 3 investigations | 3 alerts → 1 investigation covering all resources |
| 3 Slack notifications | 1 comprehensive notification |
| Engineer investigates each separately | Single root cause identified with full context |
Correlation in Practice
During Automatic Investigation
When an alert triggers an investigation, the topology discovery step finds related resources. If other alerts have fired for those resources, the investigation covers them all.
In the Portal
The investigation detail page shows all affected resources in the topology view, making it clear how alerts are connected.
In the Daily Digest
The daily summary groups related alerts to show actual incident count rather than raw alert count.
Next Steps
- How Investigations Work — Topology discovery in detail
- Noise Reduction — Reduce duplicate alerts