How Investigations Work
Overview
An OpsWorker investigation is an automated, AI-driven analysis of a Kubernetes alert. When an alert arrives, multiple AI agents work together to discover affected resources, gather data, identify the root cause, and generate actionable remediation steps.
Investigation Flow
flowchart TD
A[Alert arrives via webhook] --> B[Normalize alert format]
B --> C[Evaluate alert rules]
C --> D[Extract fields: namespace, pod, severity]
D --> E[Discover topology: pod → service → ingress]
E --> F[Collect data: logs, events, configs]
F --> G[AI analysis: root cause identification]
G --> H[Generate recommendations]
H --> I[Deliver to Slack + Portal]
Components
1. Alert Normalization
Incoming alerts from Prometheus, Grafana, Datadog, or CloudWatch are converted into a common format. OpsWorker handles the differences between alert formats automatically.
2. Field Extraction
OpsWorker identifies key fields from the alert:
- Namespace — Which Kubernetes namespace is affected
- Pod/Resource — The specific resource that triggered the alert
- Severity — Alert severity level
- Description — Human-readable context
Fast regex-based extraction runs first. If confidence is low (e.g., non-standard alert format), AI-based extraction provides a fallback.
3. Topology Discovery
Starting from the alerting resource, OpsWorker maps the dependency chain:
graph LR
Pod --> Service
Service --> Ingress
Pod --> ConfigMap
Pod --> Secret[Secret metadata]
Deployment --> Pod
ReplicaSet --> Pod
This is critical — the root cause often isn't in the alerting resource itself, but in an upstream or downstream dependency. For example, a pod crash might be caused by a misconfigured service selector or a missing configmap.
4. Data Collection
For each discovered resource, OpsWorker gathers:
| Data Type | Source | Example |
|---|---|---|
| Logs | Container stdout/stderr | Application errors, stack traces |
| Events | Kubernetes events | Pod scheduling, OOM kills, image pulls |
| Configuration | Resource specs | Deployment config, resource limits, env vars |
| Status | Resource status | Pod phase, container states, restart counts |
| Endpoints | Service endpoints | Healthy/unhealthy backends |
When additional integrations are configured, the data collection expands:
| Data Type | Integration | Example |
|---|---|---|
| Prometheus metrics | Grafana MCP | CPU/memory trends, request rates, latency via PromQL |
| Application logs | Grafana MCP (Loki) | Log patterns, error rates via LogQL |
| APM data | Datadog | Traces, error rates, latency distributions |
| Code changes | GitHub / GitLab | Recent commits, PRs correlated with incident timeline |
5. AI Analysis
A multi-model AI strategy analyzes all collected data:
- Configuration validation — Checks for reference integrity (do selectors match?), contract matching (do ports align?), and fitness (are resource limits reasonable?)
- Issue classification — Determines if the problem is configuration-related or runtime-related
- Root cause correlation — Connects signals across multiple resources to identify the underlying cause
- Confidence assessment — Rates how confident the analysis is in its findings
6. Recommendation Generation
Based on the analysis, OpsWorker generates:
- Root cause statement — What went wrong and why
- Immediate actions — Steps to fix the current issue, with specific kubectl commands
- Preventive measures — Longer-term changes to prevent recurrence
Supported Alert Types
| Alert Type | Examples | Investigation Approach |
|---|---|---|
| Pod failures | CrashLoopBackOff, ImagePullBackOff, OOMKilled | Logs, events, resource limits, exit codes |
| Service issues | No endpoints, connection refused | Service selectors, pod readiness, endpoint health |
| Ingress problems | 5xx errors, TLS failures | Ingress config, backend service, certificate status |
| Resource exhaustion | CPU throttling, memory pressure | Resource limits vs usage, HPA config |
| Deployment issues | Failed rollout, stuck rollout | Deployment strategy, pod scheduling, image availability |
Time to Complete
Most investigations complete in under 2 minutes from alert arrival to Slack notification. Investigation time depends on:
- Number of resources discovered in the topology
- Volume of logs to analyze
- Cluster agent response time
Next Steps
- Automatic Investigations — How auto-investigation works
- Investigation Lifecycle — Stages of an investigation
- Root Cause Analysis — How root causes are identified