4 Core OpsWorker Updates: Reinventing Incident Investigation with an AI SRE Agent

0:00

/0:34

Modern systems can scale faster than teams ever could.

Microservices, Kubernetes, managed cloud services, and shared infrastructure have made software delivery more powerful — but also dramatically more complex. When something goes wrong in production, discovering the root cause of an incident across dozens of interdependent services still takes way too long.

For a mid-sized organization, this can translate into hundreds of incidents a year. They all sap attention, consume engineering time, and add operational risk.

Given that we’ve built ever more powerful observability stacks, incident investigation today is still largely manual:

Engineers flip between dashboards, logs, and configurations
Context is fractured across tools and teams
Alerts tell us something is broken — but not why

And yet, the same preventable issues keep coming back.

That’s the gap OpsWorker was designed to fill.

Over the past two months, we’ve been single-minded about one thing:

Take the human guesswork out of incident investigation — and let machines do what they’re good at: understanding complex systems automatically.

Beginning next week, we’ll be publishing four key updates, packaged and designed to work together as an AI SRE Agent that drives faster incident investigation, lower MTTR, and reduced operational toil.

Why We’re Still Investigating Incidents Wrong

Cloud-native systems didn’t just add scale — they multiplied complexity.

Today:

Mean MTTR ranges from 1 to 48 hours, depending on incident complexity
70–80% of incidents are repetitive L1/L2 activities
Engineers lose 100–200 hours per year to reactive toil, maintaining the status quo
Hidden operational cost equals roughly 1 additional FTE per 10 engineers
Mid-sized SaaS companies lose $100k–$500k+ per hour of downtime, depending on impact

Alert fatigue, endless Dev ↔ Ops hand-offs, and mental graph traversal under pressure are not signs of bad teams — they are symptoms of systems operating beyond human scale.

As senior SRE talent becomes harder to hire, pressure increases, burnout risk grows, and incident response becomes even more fragile.

The 4×4 Update: What’s Coming Next Week

Starting next week, we’ll be announcing four core updates over four days — each addressing a critical failure point in modern incident workflows.

Day 1 — Intelligent Investigation

OpsWorker automatically parses alerts, discovers Kubernetes and service relationships, validates configurations, and generates structured incident investigations:

Resource topology
Error description
Root cause
Contributing factors
Immediate actions
Preventive measures
Transparent reasoning

This replaces manual incident reconstruction with a clear, evidence-based understanding.

Intelligent Investigation: Root cause, Contribution Factors and Immediate Actions to solve the incident

Intelligent Investigation: Discovering Kubernetes Objects Dependencies

Intelligent Investigation: Proposing Overall improvements to prevent incident reoccurrence

OpsWorker automatically parses alerts, discovers Kubernetes and service relationships, validates configurations, and generates structured incident investigations:

Resource topology
Error description
Likely root cause
Contributing factors
Immediate actions
Preventive measures
Transparent reasoning

This replaces manual incident reconstruction with clear, evidence-based understanding.

Day 2 — Speak to the Investigation & Your Resources

Engineers can interact directly with the investigation:

Ask follow-up questions
Explore alternative root causes
Add context
Learn from real incidents

This transforms incidents into learning moments, not just fire drills — especially for teams that own the full lifecycle from code to production.

Day 3 — Smart Slack

Alerts shouldn’t require a PhD in dashboards.

Smart Slack delivers:

Human-readable alert descriptions
Explicit impact categorization (user-facing, internal, platform)
Root cause findings
Key evidence
Immediate, copy-paste-ready actions
Escalation signals when needed

Incident investigation begins — and often ends — inside Slack.

OpsWorker: Smart Actionable Slack Messages

Day 4 — Enhanced Alerts View

A new alerts experience allows teams to:

Filter and investigate alerts across environments
Understand impact domains and severity
Launch investigations directly from alerts

Detection and investigation finally become one continuous workflow.

OpsWorker: Enhanced Alerts View with Possibility to Filter and Run Investigation

Who This Is Built For

These updates are designed to support real-world incident work across the organization:

SREs & On-call Engineers
Faster investigations, lower cognitive load, reduced burnout
Software Engineers
Clear production feedback without deep infrastructure guesswork
Platform & DevOps Teams
Systemic improvements instead of repeated firefighting
CTOs & IT Leaders
Lower MTTR and higher reliability — without scaling headcount

Together, these capabilities form an AI SRE Agent that can autonomously troubleshoot incidents in complex environments, correlate signals across systems, and propose actionable solutions in minutes instead of hours.

What’s Next

Starting next week:

Day 1: Intelligent Investigation
Day 2: Chat with the Investigation
Day 3: Smart Slack
Day 4: Enhanced Alerts View

One feature per day.
Real product workflows.
Built for how incidents actually happen.

Tagged in:

Product News

4 Core OpsWorker Updates: Reinventing Incident Investigation with an AI SRE Agent

Why We’re Still Investigating Incidents Wrong