Architecture Overview

High-Level Design

OpsWorker is a cloud-native, serverless platform built on AWS. It consists of three main components:

graph TB
    subgraph "Your Infrastructure"
        Mon[Monitoring Systems<br/>Prometheus, Grafana, Datadog]
        K8s[Kubernetes Clusters]
        Agent[OpsWorker Agent<br/>in-cluster]
    end

    subgraph "OpsWorker Cloud (AWS)"
        API[API Gateway]
        Norm[Alert Normalizer]
        Rules[Rule Engine]
        AI[AI Investigation Engine]
        DB[(Data Store)]
        Notify[Notification Service]
    end

    subgraph "Team Tools"
        Slack[Slack]
        Portal[OpsWorker Portal]
    end

    Mon -->|Webhook| API
    API --> Norm --> Rules --> AI
    AI <-->|SQS| Agent
    Agent <-->|kubectl| K8s
    AI --> DB
    DB --> Notify
    Notify --> Slack
    DB --> Portal

Components

OpsWorker Cloud

The serverless backend running on AWS:

Component	Technology	Purpose
API Gateway	AWS API Gateway	Receives alert webhooks
Alert Normalizer	AWS Lambda	Converts alerts to common format
Rule Engine	AWS Lambda	Evaluates alert rules
Investigation Engine	AWS Lambda + Strands	Multi-agent AI investigation
Data Store	DynamoDB + S3	Investigation data and results
Notification Service	SNS + Lambda	Slack notifications and daily digests
Customer Portal	Next.js (CloudFront)	Web interface for users

Kubernetes Agent

A lightweight Go binary deployed in customer clusters via Helm:

Communicates outbound-only via AWS SQS
Executes read-only kubectl commands
Returns data to the investigation engine

Customer Integrations

External services connected to OpsWorker:

Alert sources: Prometheus Alertmanager, Grafana Alerting, Datadog
Notifications: Slack
Code: GitHub, GitLab
Observability queries: Grafana MCP (PromQL, LogQL, dashboards), Kubernetes MCP
AI: AWS Bedrock (default), or your own LLM via the Custom Reasoning Model integration

Design Principles

Principle	Implementation
Security-first	Read-only cluster access, outbound-only communication, no stored credentials
Serverless	No servers to manage, automatic scaling, pay-per-use
Multi-tenant	Organization-level data isolation
Event-driven	Alerts trigger investigations asynchronously via message queues
Multi-model AI	Different AI models optimized for each investigation stage

Next Steps

Investigation Flow — Detailed data flow
Security & Compliance — Security architecture
Deployment Options — Available deployment models

High-Level Design​

Components​

OpsWorker Cloud​

Kubernetes Agent​

Customer Integrations​

Design Principles​

Next Steps​