Skip to main content

Key Metrics

Overview

OpsWorker tracks several categories of operational metrics to help you understand your alert landscape and measure the impact of automated investigation.

Alert Metrics

MetricDescription
Total alertsNumber of signals received from all monitoring sources
Alerts by severityBreakdown: critical, warning, info
Alerts by sourceBreakdown: Prometheus, Grafana, Datadog, CloudWatch
Alerts by clusterVolume per connected cluster
Alerts by namespaceMost active namespaces
Alert trendWeek-over-week and month-over-month comparisons

Investigation Metrics

MetricDescription
Total investigationsNumber of completed investigations
Investigation completion ratePercentage that completed successfully vs. failed
Average investigation timeMean time from alert to completed investigation
Investigations by clusterVolume per cluster
Investigation outcomesDistribution of root cause types identified

Time Saved Metrics

MetricDescription
Estimated hours savedInvestigations × average manual investigation time
Time saved per investigationBased on typical manual investigation duration (30–80 min baseline)
Cumulative savingsRunning total over selected time period

Feedback Metrics

MetricDescription
Accuracy ratePercentage of investigations rated "Accurate"
Partial accuracy ratePercentage rated "Partially Accurate"
Feedback response ratePercentage of investigations that received feedback

Using Metrics

  • Report to leadership: Use time saved and investigation count for ROI reporting
  • Track improvement: Watch alert volume trends — decreasing volume indicates root causes are being addressed
  • Identify hot spots: Use namespace and cluster breakdowns to focus engineering effort
  • Evaluate accuracy: Use feedback metrics to assess investigation quality over time

Next Steps