OpsWorker: AI-Powered Incident Investigation built on AWS

The Problem We're Solving

When a Kubernetes alert fires at 3 AM, SREs face a familiar drill: SSH into clusters, hunt through logs, trace service dependencies, and piece together what went wrong. This manual investigation is time-consuming, repetitive, and delays incident resolution. We built OpsWorker to automate this entire workflow.

What OpsWorker Does

OpsWorker is an AI-powered observability platform that automatically investigates Kubernetes alerts. When an alert arrives from Prometheus AlertManager, Datadog, or CloudWatch, OpsWorker springs into action: it extracts resource information, discovers related Kubernetes objects (pods, services, deployments, ingresses), gathers runtime data (logs, events, configurations), and generates a comprehensive root cause analysis with actionable remediation steps. The entire investigation happens in under a minute, 24/7.

Results are delivered instantly via **Slack notifications**, so your on-call engineers get actionable insights right where they work. Engineers can provide feedback on the analysis quality directly in Slack—no context switching required. This feedback loop lets us continuously improve our AI models and investigation strategies—learning from real incidents to get smarter over time.

Human-in-the-Loop with AI Chat

But we didn't stop at automated reports. Our **Chat interface** lets SREs have a conversation with the investigation. Need to dig deeper? Ask follow-up questions like "Why did the pod run out of memory?" or "Show me similar incidents from last week." The chat uses **Claude Haiku** for fast, responsive interactions and can even execute live Kubernetes queries on demand. This human-in-the-loop approach combines the speed of automation with the intuition of experienced engineers.

Built on AWS Serverless & Amazon Bedrock

We're 100% serverless on AWS. Alerts flow through API Gateway → SQS → Lambda for normalization and storage in DynamoDB. EventBridge triggers our investigation workflow, which runs as a Lambda function orchestrating an AI agent graph.

Our Amazon Bedrock model strategy:

Claude Sonnet - Complex reasoning and root cause analysis in the investigation flow
Amazon Nova - Fast extraction tasks during alert processing
Claude Haiku - Responsive, low-latency chat interactions

We communicate with customer Kubernetes clusters via SQS FIFO queues to a lightweight in-cluster agent. The Review UI is served via CloudFront + S3, with a **WebSocket API** for real-time chat streaming.

Why We Chose This Architecture

Serverless was a natural fit—alerts are bursty and unpredictable. We pay only for what we use, scale automatically, and have zero infrastructure to manage. Amazon Bedrock gives us access to best-in-class foundation models without managing ML infrastructure—and lets us pick the right model for each task (Sonnet for depth, Haiku for speed, Nova for extraction). The multi-tenant design (per-org DynamoDB tables, SQS message groups) lets us serve multiple customers securely from a single deployment.

OpsWorker: AI-Powered Incident Investigation built on AWS

The Problem We're Solving

What OpsWorker Does

Human-in-the-Loop with AI Chat

Built on AWS Serverless & Amazon Bedrock

Why We Chose This Architecture

About the Author

Ar Hakboian

Check latest articles from this author:

Product Update: Elevating AI Observability with Custom LLMs and Intelligent Kubernetes Analysis

OpsWorker.ai Joins NVIDIA Inception — Building AI SRE for Production-Grade Workloads

From Blueprint to AI Code: What 20 Years of Architecture Taught Me About Building AI Products

Comments

Previous Article

4 Core OpsWorker Updates: Reinventing Incident Investigation with an AI SRE Agent

Next Article

OpsWorker AI SRE Intelligent Investigation: From Alert to Root Cause in Minutes

Building Self-Healing Kubernetes Systems with AI SRE Agents

Product Update: Elevating AI Observability with Custom LLMs and Intelligent Kubernetes Analysis

Predictive Analytics for Proactive Kubernetes Incident Management

Press ESC to close

Or check our Popular Categories...

The Problem We're Solving

What OpsWorker Does

Human-in-the-Loop with AI Chat

Built on AWS Serverless & Amazon Bedrock

Why We Chose This Architecture

Like what you read?

Subscribe to our Newsletter

About the Author

Check latest articles from this author:

Comments

Previous Article

Next Article