Skip to main content

Investigation Lifecycle

Overview

Every investigation moves through a defined set of stages from alert arrival to result delivery. Understanding the lifecycle helps you interpret investigation status and troubleshoot issues.

Stages

stateDiagram-v2
[*] --> Pending: Alert matches rule
Pending --> InProgress: Investigation starts
InProgress --> Extracting: Field extraction
Extracting --> Discovering: Topology discovery
Discovering --> Collecting: Data collection
Collecting --> Analyzing: AI analysis
Analyzing --> Completed: Results ready
InProgress --> Failed: Error occurred
Completed --> [*]
Failed --> [*]

Pending

The alert has been received and matched an alert rule. The investigation is queued for processing.

  • Duration: Seconds
  • Visible in: Portal investigation list (status: Pending)

In Progress

The investigation is actively running. AI agents are working through the investigation steps:

Field Extraction

Identifies the affected namespace, pod, severity, and other key fields from the alert payload. Uses fast regex-based extraction first, with AI-based fallback for non-standard formats.

Topology Discovery

Starting from the identified resource, maps the dependency chain (pod → service → ingress). Discovers related resources that may contain the root cause.

Data Collection

Gathers data from all discovered resources:

  • Pod logs (recent container output)
  • Kubernetes events (scheduling, state changes, errors)
  • Resource configurations (specs, limits, selectors)
  • Service endpoint health

AI Analysis

Analyzes all collected data to:

  • Validate configurations (reference integrity, port alignment, selector matching)
  • Classify the issue (configuration vs. runtime)
  • Identify the root cause with a confidence level
  • Generate specific remediation recommendations

Completed

The investigation has finished. Results include:

  • Root cause analysis with confidence level
  • Affected resources and their topology
  • Immediate action recommendations with kubectl commands
  • Preventive measure recommendations
  • Complete conversation log (AI's reasoning process)

Notifications are sent to configured Slack channels and the investigation is available in the portal.

Failed

The investigation could not complete. Common reasons:

ReasonResolution
Cluster agent disconnectedCheck agent pod status and connectivity
TimeoutAgent may be overloaded — check resource limits
Insufficient permissionsAgent RBAC may not cover the affected namespace

Failed investigations are visible in the portal with error details.

Viewing Investigation Status

  • Portal: Navigate to Investigations — each investigation shows its current status
  • Slack: Notifications are sent only when investigations complete
  • Investigation detail page: Shows the full timeline of each stage

Data Retention

Completed investigations and their collected data are retained in the portal for historical review and trend analysis.

Next Steps