Architecture Overview
High-Level Design
OpsWorker is a cloud-native, serverless platform built on AWS. It consists of three main components:
graph TB
subgraph "Your Infrastructure"
Mon[Monitoring Systems<br/>Prometheus, Grafana, Datadog]
K8s[Kubernetes Clusters]
Agent[OpsWorker Agent<br/>in-cluster]
end
subgraph "OpsWorker Cloud (AWS)"
API[API Gateway]
Norm[Alert Normalizer]
Rules[Rule Engine]
AI[AI Investigation Engine]
DB[(Data Store)]
Notify[Notification Service]
end
subgraph "Team Tools"
Slack[Slack]
Portal[OpsWorker Portal]
end
Mon -->|Webhook| API
API --> Norm --> Rules --> AI
AI <-->|SQS| Agent
Agent <-->|kubectl| K8s
AI --> DB
DB --> Notify
Notify --> Slack
DB --> Portal
Components
OpsWorker Cloud
The serverless backend running on AWS:
| Component | Technology | Purpose |
|---|---|---|
| API Gateway | AWS API Gateway | Receives alert webhooks |
| Alert Normalizer | AWS Lambda | Converts alerts to common format |
| Rule Engine | AWS Lambda | Evaluates alert rules |
| Investigation Engine | AWS Lambda + Strands | Multi-agent AI investigation |
| Data Store | DynamoDB + S3 | Investigation data and results |
| Notification Service | SNS + Lambda | Slack notifications and daily digests |
| Customer Portal | Next.js (CloudFront) | Web interface for users |
Kubernetes Agent
A lightweight Go binary deployed in customer clusters via Helm:
- Communicates outbound-only via AWS SQS
- Executes read-only kubectl commands
- Returns data to the investigation engine
Customer Integrations
External services connected to OpsWorker:
- Alert sources: Prometheus, Grafana, Datadog, CloudWatch
- Notifications: Slack
- Code: GitHub, GitLab
- AI: AWS Bedrock (default), Azure OpenAI, BYO LLM
Design Principles
| Principle | Implementation |
|---|---|
| Security-first | Read-only cluster access, outbound-only communication, no stored credentials |
| Serverless | No servers to manage, automatic scaling, pay-per-use |
| Multi-tenant | Organization-level data isolation |
| Event-driven | Alerts trigger investigations asynchronously via message queues |
| Multi-model AI | Different AI models optimized for each investigation stage |
Next Steps
- Investigation Flow — Detailed data flow
- Security & Compliance — Security architecture
- Deployment Options — Available deployment models