Skip to main content

Investigation Flow

End-to-End Data Flow

sequenceDiagram
participant Mon as Monitoring System
participant API as API Gateway
participant Norm as Normalizer
participant Rules as Rule Engine
participant AI as Investigation Engine
participant SQS as SQS
participant Agent as K8s Agent
participant K8s as Kubernetes API
participant DB as Data Store
participant Slack as Slack

Mon->>API: Alert webhook (POST)
API->>Norm: Raw alert payload
Norm->>DB: Store normalized signal
Norm->>Rules: Evaluate against alert rules
Rules->>AI: Match found → start investigation

Note over AI: Field Extraction
AI->>AI: Extract namespace, pod, severity

Note over AI: Topology Discovery
AI->>SQS: "Get pod details"
SQS->>Agent: Command
Agent->>K8s: kubectl get pod
K8s-->>Agent: Pod spec + status
Agent->>SQS: Results
SQS-->>AI: Pod data
AI->>SQS: "Get related service"
Agent->>K8s: kubectl get svc
AI->>AI: Build dependency graph

Note over AI: Data Collection
AI->>SQS: "Get logs, events, configs"
Agent->>K8s: Multiple queries
SQS-->>AI: Collected data

Note over AI: AI Analysis
AI->>AI: Root cause analysis
AI->>AI: Generate recommendations

AI->>DB: Store investigation results
DB->>Slack: Notification with summary

Steps in Detail

1. Alert Reception

Monitoring system sends an alert via webhook to OpsWorker's API Gateway. Supported formats: Prometheus AlertManager, Grafana, Datadog, CloudWatch.

2. Normalization

The alert is converted into OpsWorker's common format regardless of source. Key fields extracted: alert name, severity, labels, annotations.

3. Rule Evaluation

The normalized signal is evaluated against configured alert rules. If a rule matches and auto-investigation is enabled, an investigation starts.

4. Field Extraction

AI identifies the affected namespace, pod name, and other context from the alert. Fast regex-based extraction runs first; AI-based extraction is used as fallback for complex alert formats.

5. Topology Discovery

Starting from the identified resource, OpsWorker maps the dependency chain using breadth-first search. A pod crash investigation might discover: Deployment → ReplicaSet → Pod → Service → Ingress.

6. Data Collection

For each resource in the topology, the agent gathers logs, events, configurations, and status information. All queries are read-only.

7. AI Analysis

Multi-model AI analyzes all collected data: validates configurations, classifies the issue type, identifies the root cause, and assesses confidence.

8. Recommendation Generation

Based on the analysis, specific remediation steps are generated with kubectl commands tailored to the actual resources and namespaces.

9. Result Delivery

Results are stored in the database, a notification is sent to Slack, and the investigation is viewable in the portal.

Next Steps