How Investigations Work

Overview

An OpsWorker investigation is an automated, AI-driven analysis of a Kubernetes alert. When an alert arrives, multiple AI agents work together to discover affected resources, gather data, identify the root cause, and generate actionable remediation steps.

Investigation Flow

flowchart TD
    A[Alert arrives via webhook] --> B[Normalize alert format]
    B --> C[Evaluate alert rules]
    C --> D[Extract fields: namespace, pod, severity]
    D --> E[Discover topology: pod → service → ingress]
    E --> F[Collect data: logs, events, configs]
    F --> G[AI analysis: root cause identification]
    G --> H[Generate recommendations]
    H --> I[Deliver to Slack + Portal]

Components

1. Alert Normalization

Incoming alerts from Prometheus AlertManager, Grafana Alerting, and Datadog are converted into a common format. OpsWorker recognizes the AlertManager and Datadog webhook payload formats natively; sources like Grafana Alerting that emit AlertManager-compatible payloads work without custom code paths.

2. Field Extraction

OpsWorker identifies key fields from the alert:

Namespace — Which Kubernetes namespace is affected
Pod/Resource — The specific resource that triggered the alert
Severity — Alert severity level
Description — Human-readable context

Fast regex-based extraction runs first. If confidence is low (e.g., non-standard alert format), AI-based extraction provides a fallback.

3. Topology Discovery

Starting from the alerting resource, OpsWorker maps the dependency chain:

graph LR
    Pod --> Service
    Service --> Ingress
    Pod --> ConfigMap
    Pod --> Secret[Secret metadata]
    Deployment --> Pod
    ReplicaSet --> Pod

This is critical — the root cause often isn't in the alerting resource itself, but in an upstream or downstream dependency. For example, a pod crash might be caused by a misconfigured service selector or a missing configmap.

4. Data Collection

For each discovered resource, OpsWorker gathers:

Data Type	Source	Example
Logs	Container stdout/stderr	Application errors, stack traces
Events	Kubernetes events	Pod scheduling, OOM kills, image pulls
Configuration	Resource specs	Deployment config, resource limits, env vars
Status	Resource status	Pod phase, container states, restart counts
Endpoints	Service endpoints	Healthy/unhealthy backends

When additional integrations are configured, the data collection expands:

Data Type	Integration	Example
Prometheus metrics	Grafana MCP	CPU/memory trends, request rates, latency via PromQL
Application logs	Grafana MCP (Loki)	Log patterns, error rates via LogQL
APM data	Datadog	Traces, error rates, latency distributions
Code changes	GitHub / GitLab	Recent commits, PRs correlated with incident timeline

5. AI Analysis

A multi-model AI strategy analyzes all collected data:

Configuration validation — Checks for reference integrity (do selectors match?), contract matching (do ports align?), and fitness (are resource limits reasonable?)
Issue classification — Determines if the problem is configuration-related or runtime-related
Root cause correlation — Connects signals across multiple resources to identify the underlying cause
Confidence assessment — Rates how confident the analysis is in its findings

6. Recommendation Generation

Based on the analysis, OpsWorker generates:

Root cause statement — What went wrong and why
Immediate actions — Steps to fix the current issue, with specific kubectl commands
Preventive measures — Longer-term changes to prevent recurrence

Supported Alert Types

Alert Type	Examples	Investigation Approach
Pod failures	CrashLoopBackOff, ImagePullBackOff, OOMKilled	Logs, events, resource limits, exit codes
Service issues	No endpoints, connection refused	Service selectors, pod readiness, endpoint health
Ingress problems	5xx errors, TLS failures	Ingress config, backend service, certificate status
Resource exhaustion	CPU throttling, memory pressure	Resource limits vs usage, HPA config
Deployment issues	Failed rollout, stuck rollout	Deployment strategy, pod scheduling, image availability

Time to Complete

Most investigations complete in under 2 minutes from alert arrival to Slack notification. Investigation time depends on:

Number of resources discovered in the topology
Volume of logs to analyze
Cluster agent response time

Next Steps

Automatic Investigations — How auto-investigation works
Investigation Lifecycle — Stages of an investigation
Root Cause Analysis — How root causes are identified

Overview​

Investigation Flow​

Components​

1. Alert Normalization​

2. Field Extraction​

3. Topology Discovery​

4. Data Collection​

5. AI Analysis​

6. Recommendation Generation​

Supported Alert Types​

Time to Complete​

Next Steps​