Investigation Flow
End-to-End Data Flow
sequenceDiagram
participant Mon as Monitoring System
participant API as API Gateway
participant Norm as Normalizer
participant Rules as Rule Engine
participant AI as Investigation Engine
participant SQS as SQS
participant Agent as K8s Agent
participant K8s as Kubernetes API
participant DB as Data Store
participant Slack as Slack
Mon->>API: Alert webhook (POST)
API->>Norm: Raw alert payload
Norm->>DB: Store normalized signal
Norm->>Rules: Evaluate against alert rules
Rules->>AI: Match found → start investigation
Note over AI: Field Extraction
AI->>AI: Extract namespace, pod, severity
Note over AI: Topology Discovery
AI->>SQS: "Get pod details"
SQS->>Agent: Command
Agent->>K8s: kubectl get pod
K8s-->>Agent: Pod spec + status
Agent->>SQS: Results
SQS-->>AI: Pod data
AI->>SQS: "Get related service"
Agent->>K8s: kubectl get svc
AI->>AI: Build dependency graph
Note over AI: Data Collection
AI->>SQS: "Get logs, events, configs"
Agent->>K8s: Multiple queries
SQS-->>AI: Collected data
Note over AI: AI Analysis
AI->>AI: Root cause analysis
AI->>AI: Generate recommendations
AI->>DB: Store investigation results
DB->>Slack: Notification with summary
Steps in Detail
1. Alert Reception
Monitoring system sends an alert via webhook to OpsWorker's API Gateway. Supported formats: Prometheus AlertManager, Grafana, Datadog, CloudWatch.
2. Normalization
The alert is converted into OpsWorker's common format regardless of source. Key fields extracted: alert name, severity, labels, annotations.
3. Rule Evaluation
The normalized signal is evaluated against configured alert rules. If a rule matches and auto-investigation is enabled, an investigation starts.
4. Field Extraction
AI identifies the affected namespace, pod name, and other context from the alert. Fast regex-based extraction runs first; AI-based extraction is used as fallback for complex alert formats.
5. Topology Discovery
Starting from the identified resource, OpsWorker maps the dependency chain using breadth-first search. A pod crash investigation might discover: Deployment → ReplicaSet → Pod → Service → Ingress.
6. Data Collection
For each resource in the topology, the agent gathers logs, events, configurations, and status information. All queries are read-only.
7. AI Analysis
Multi-model AI analyzes all collected data: validates configurations, classifies the issue type, identifies the root cause, and assesses confidence.
8. Recommendation Generation
Based on the analysis, specific remediation steps are generated with kubectl commands tailored to the actual resources and namespaces.
9. Result Delivery
Results are stored in the database, a notification is sent to Slack, and the investigation is viewable in the portal.
Next Steps
- Data Collection — What data is gathered
- Data Processing — How data is analyzed