The Problem We're Solving

When a Kubernetes alert fires at 3 AM, SREs face a familiar drill: SSH into clusters, hunt through logs, trace service dependencies, and piece together what went wrong. This manual investigation is time-consuming, repetitive, and delays incident resolution. We built OpsWorker to automate this entire workflow.

What OpsWorker Does

OpsWorker is an AI-powered observability platform that automatically investigates Kubernetes alerts. When an alert arrives from Prometheus AlertManager, Datadog, or CloudWatch, OpsWorker springs into action: it extracts resource information, discovers related Kubernetes objects (pods, services, deployments, ingresses), gathers runtime data (logs, events, configurations), and generates a comprehensive root cause analysis with actionable remediation steps. The entire investigation happens in under a minute, 24/7.

Results are delivered instantly via **Slack notifications**, so your on-call engineers get actionable insights right where they work. Engineers can provide feedback on the analysis quality directly in Slack—no context switching required. This feedback loop lets us continuously improve our AI models and investigation strategies—learning from real incidents to get smarter over time.

Human-in-the-Loop with AI Chat

But we didn't stop at automated reports. Our **Chat interface** lets SREs have a conversation with the investigation. Need to dig deeper? Ask follow-up questions like "Why did the pod run out of memory?" or "Show me similar incidents from last week." The chat uses **Claude Haiku** for fast, responsive interactions and can even execute live Kubernetes queries on demand. This human-in-the-loop approach combines the speed of automation with the intuition of experienced engineers.

Built on AWS Serverless & Amazon Bedrock

We're 100% serverless on AWS. Alerts flow through API Gateway → SQS → Lambda for normalization and storage in DynamoDB. EventBridge triggers our investigation workflow, which runs as a Lambda function orchestrating an AI agent graph.

Our Amazon Bedrock model strategy:

  • Claude Sonnet - Complex reasoning and root cause analysis in the investigation flow
  • Amazon Nova - Fast extraction tasks during alert processing
  • Claude Haiku - Responsive, low-latency chat interactions

We communicate with customer Kubernetes clusters via SQS FIFO queues to a lightweight in-cluster agent. The Review UI is served via CloudFront + S3, with a **WebSocket API** for real-time chat streaming.

opsworker AWS architecture

OpsWorker Component View

Why We Chose This Architecture

Serverless was a natural fit—alerts are bursty and unpredictable. We pay only for what we use, scale automatically, and have zero infrastructure to manage. Amazon Bedrock gives us access to best-in-class foundation models without managing ML infrastructure—and lets us pick the right model for each task (Sonnet for depth, Haiku for speed, Nova for extraction). The multi-tenant design (per-org DynamoDB tables, SQS message groups) lets us serve multiple customers securely from a single deployment.