Investigation Flow

End-to-End Data Flow

sequenceDiagram
    participant Mon as Monitoring System
    participant API as API Gateway
    participant Norm as Normalizer
    participant Rules as Rule Engine
    participant AI as Investigation Engine
    participant SQS as SQS
    participant Agent as K8s Agent
    participant K8s as Kubernetes API
    participant DB as Data Store
    participant Slack as Slack

    Mon->>API: Alert webhook (POST)
    API->>Norm: Raw alert payload
    Norm->>DB: Store normalized signal
    Norm->>Rules: Evaluate against alert rules
    Rules->>AI: Match found → start investigation

    Note over AI: Field Extraction
    AI->>AI: Extract namespace, pod, severity

    Note over AI: Topology Discovery
    AI->>SQS: "Get pod details"
    SQS->>Agent: Command
    Agent->>K8s: kubectl get pod
    K8s-->>Agent: Pod spec + status
    Agent->>SQS: Results
    SQS-->>AI: Pod data
    AI->>SQS: "Get related service"
    Agent->>K8s: kubectl get svc
    AI->>AI: Build dependency graph

    Note over AI: Data Collection
    AI->>SQS: "Get logs, events, configs"
    Agent->>K8s: Multiple queries
    SQS-->>AI: Collected data

    Note over AI: AI Analysis
    AI->>AI: Root cause analysis
    AI->>AI: Generate recommendations

    AI->>DB: Store investigation results
    DB->>Slack: Notification with summary

Steps in Detail

1. Alert Reception

Monitoring system sends an alert via webhook to OpsWorker's API Gateway. Supported native formats: Prometheus AlertManager (also used by Grafana Alerting, which speaks the AlertManager webhook format) and Datadog. Any source that can post a webhook payload compatible with one of these formats works without custom configuration.

2. Normalization

The alert is converted into OpsWorker's common format regardless of source. Key fields extracted: alert name, severity, labels, annotations.

3. Rule Evaluation

The normalized signal is evaluated against configured alert rules. If a rule matches and auto-investigation is enabled, an investigation starts.

4. Field Extraction

AI identifies the affected namespace, pod name, and other context from the alert. Fast regex-based extraction runs first; AI-based extraction is used as fallback for complex alert formats.

5. Topology Discovery

Starting from the identified resource, OpsWorker maps the dependency chain using breadth-first search. A pod crash investigation might discover: Deployment → ReplicaSet → Pod → Service → Ingress.

6. Data Collection

For each resource in the topology, the agent gathers logs, events, configurations, and status information. All queries are read-only.

7. AI Analysis

Multi-model AI analyzes all collected data: validates configurations, classifies the issue type, identifies the root cause, and assesses confidence.

8. Recommendation Generation

Based on the analysis, specific remediation steps are generated with kubectl commands tailored to the actual resources and namespaces.

9. Result Delivery

Results are stored in the database, a notification is sent to Slack, and the investigation is viewable in the portal.

Next Steps

Data Collection — What data is gathered
Data Processing — How data is analyzed

End-to-End Data Flow​

Steps in Detail​

1. Alert Reception​

2. Normalization​

3. Rule Evaluation​

4. Field Extraction​

5. Topology Discovery​

6. Data Collection​

7. AI Analysis​

8. Recommendation Generation​

9. Result Delivery​

Next Steps​