Investigation Lifecycle
Overview
Every investigation moves through a defined set of stages from alert arrival to result delivery. Understanding the lifecycle helps you interpret investigation status and troubleshoot issues.
Stages
stateDiagram-v2
[*] --> Pending: Alert matches rule
Pending --> InProgress: Investigation starts
InProgress --> Extracting: Field extraction
Extracting --> Discovering: Topology discovery
Discovering --> Collecting: Data collection
Collecting --> Analyzing: AI analysis
Analyzing --> Completed: Results ready
InProgress --> Failed: Error occurred
Completed --> [*]
Failed --> [*]
Pending
The alert has been received and matched an alert rule. The investigation is queued for processing.
- Duration: Seconds
- Visible in: Portal investigation list (status: Pending)
In Progress
The investigation is actively running. AI agents are working through the investigation steps:
Field Extraction
Identifies the affected namespace, pod, severity, and other key fields from the alert payload. Uses fast regex-based extraction first, with AI-based fallback for non-standard formats.
Topology Discovery
Starting from the identified resource, maps the dependency chain (pod → service → ingress). Discovers related resources that may contain the root cause.
Data Collection
Gathers data from all discovered resources:
- Pod logs (recent container output)
- Kubernetes events (scheduling, state changes, errors)
- Resource configurations (specs, limits, selectors)
- Service endpoint health
AI Analysis
Analyzes all collected data to:
- Validate configurations (reference integrity, port alignment, selector matching)
- Classify the issue (configuration vs. runtime)
- Identify the root cause with a confidence level
- Generate specific remediation recommendations
Completed
The investigation has finished. Results include:
- Root cause analysis with confidence level
- Affected resources and their topology
- Immediate action recommendations with kubectl commands
- Preventive measure recommendations
- Complete conversation log (AI's reasoning process)
Notifications are sent to configured Slack channels and the investigation is available in the portal.
Failed
The investigation could not complete. Common reasons:
| Reason | Resolution |
|---|---|
| Cluster agent disconnected | Check agent pod status and connectivity |
| Timeout | Agent may be overloaded — check resource limits |
| Insufficient permissions | Agent RBAC may not cover the affected namespace |
Failed investigations are visible in the portal with error details.
Viewing Investigation Status
- Portal: Navigate to Investigations — each investigation shows its current status
- Slack: Notifications are sent only when investigations complete
- Investigation detail page: Shows the full timeline of each stage
Data Retention
Completed investigations and their collected data are retained in the portal for historical review and trend analysis.
Next Steps
- Investigation Chat — Ask follow-up questions
- How Investigations Work — Technical deep dive