Skip to main content

How Investigations Work

Overview

An OpsWorker investigation is an automated, AI-driven analysis of a Kubernetes alert. When an alert arrives, multiple AI agents work together to discover affected resources, gather data, identify the root cause, and generate actionable remediation steps.

Investigation Flow

flowchart TD
A[Alert arrives via webhook] --> B[Normalize alert format]
B --> C[Evaluate alert rules]
C --> D[Extract fields: namespace, pod, severity]
D --> E[Discover topology: pod → service → ingress]
E --> F[Collect data: logs, events, configs]
F --> G[AI analysis: root cause identification]
G --> H[Generate recommendations]
H --> I[Deliver to Slack + Portal]

Components

1. Alert Normalization

Incoming alerts from Prometheus, Grafana, Datadog, or CloudWatch are converted into a common format. OpsWorker handles the differences between alert formats automatically.

2. Field Extraction

OpsWorker identifies key fields from the alert:

  • Namespace — Which Kubernetes namespace is affected
  • Pod/Resource — The specific resource that triggered the alert
  • Severity — Alert severity level
  • Description — Human-readable context

Fast regex-based extraction runs first. If confidence is low (e.g., non-standard alert format), AI-based extraction provides a fallback.

3. Topology Discovery

Starting from the alerting resource, OpsWorker maps the dependency chain:

graph LR
Pod --> Service
Service --> Ingress
Pod --> ConfigMap
Pod --> Secret[Secret metadata]
Deployment --> Pod
ReplicaSet --> Pod

This is critical — the root cause often isn't in the alerting resource itself, but in an upstream or downstream dependency. For example, a pod crash might be caused by a misconfigured service selector or a missing configmap.

4. Data Collection

For each discovered resource, OpsWorker gathers:

Data TypeSourceExample
LogsContainer stdout/stderrApplication errors, stack traces
EventsKubernetes eventsPod scheduling, OOM kills, image pulls
ConfigurationResource specsDeployment config, resource limits, env vars
StatusResource statusPod phase, container states, restart counts
EndpointsService endpointsHealthy/unhealthy backends

When additional integrations are configured, the data collection expands:

Data TypeIntegrationExample
Prometheus metricsGrafana MCPCPU/memory trends, request rates, latency via PromQL
Application logsGrafana MCP (Loki)Log patterns, error rates via LogQL
APM dataDatadogTraces, error rates, latency distributions
Code changesGitHub / GitLabRecent commits, PRs correlated with incident timeline

5. AI Analysis

A multi-model AI strategy analyzes all collected data:

  • Configuration validation — Checks for reference integrity (do selectors match?), contract matching (do ports align?), and fitness (are resource limits reasonable?)
  • Issue classification — Determines if the problem is configuration-related or runtime-related
  • Root cause correlation — Connects signals across multiple resources to identify the underlying cause
  • Confidence assessment — Rates how confident the analysis is in its findings

6. Recommendation Generation

Based on the analysis, OpsWorker generates:

  • Root cause statement — What went wrong and why
  • Immediate actions — Steps to fix the current issue, with specific kubectl commands
  • Preventive measures — Longer-term changes to prevent recurrence

Supported Alert Types

Alert TypeExamplesInvestigation Approach
Pod failuresCrashLoopBackOff, ImagePullBackOff, OOMKilledLogs, events, resource limits, exit codes
Service issuesNo endpoints, connection refusedService selectors, pod readiness, endpoint health
Ingress problems5xx errors, TLS failuresIngress config, backend service, certificate status
Resource exhaustionCPU throttling, memory pressureResource limits vs usage, HPA config
Deployment issuesFailed rollout, stuck rolloutDeployment strategy, pod scheduling, image availability

Time to Complete

Most investigations complete in under 2 minutes from alert arrival to Slack notification. Investigation time depends on:

  • Number of resources discovered in the topology
  • Volume of logs to analyze
  • Cluster agent response time

Next Steps