Most incident management is reactive. Something breaks, an alert fires, an engineer wakes up, and the investigation starts. The industry has invested heavily in making that investigation faster - better dashboards, smarter alerting, automated runbooks. All of that happens after the failure.
Predictive analytics asks a different question: what if you could see the failure coming?
This post walks through a practical approach to building a predictive layer on top of your existing Kubernetes monitoring stack. We'll look at how to use historical Prometheus metrics to forecast pod instability, and how to connect that forecast to automated investigation so your team has context before the alert even fires.
The gap between monitoring and prediction
Standard Prometheus alerting is threshold-based. CPU above 80% for five minutes - alert. Memory limit within 10% - alert. Pod restart count above three - alert.
Thresholds work. They're also blind to the patterns that precede failures. A pod that restarts once per day is below most alerting thresholds. A pod that has restarted once per day for the past week, with a clearly accelerating trend, is almost certainly about to become a real incident. Threshold alerting treats those two situations identically.
Time-series forecasting treats them very differently.
The core idea: train a model on your historical Prometheus data to predict future resource behavior. When the predicted trajectory crosses a danger zone before it actually gets there, trigger investigation now - while the system is still healthy enough to be debugged.
Step 1 - Collecting the right historical data
The most useful metrics for pod instability prediction are:
- kube_pod_container_status_restarts_total - restart count over time, the most direct signal
- container_memory_usage_bytes vs kube_pod_container_resource_limits - how close to the memory ceiling
- container_cpu_usage_seconds_total - rate of CPU consumption
- kube_deployment_status_replicas_unavailable - replica health across a deployment
You can pull this data from Prometheus directly using the HTTP API. Here's a Python snippet to fetch the last seven days of restart counts for all pods in a namespace:
Seven days of data at 5-minute resolution gives you 2,016 data points per pod - enough to train a basic time-series model and enough to capture weekly patterns (deployment schedules, traffic spikes, batch jobs) that would otherwise look like anomalies.
Step 2 - Building the anomaly detection model
For restart rate prediction, a straightforward approach is to model the rate of change of restarts over time and flag pods where that rate is accelerating beyond a historical baseline.
We'll use scikit-learn's IsolationForest for anomaly scoring, combined with a simple linear trend component. IsolationForest works well here because it doesn't require labeled anomaly data - it learns what "normal" looks like from the majority of your data, then scores deviations.
The contamination=0.05 parameter tells the model to treat the most unusual 5% of historical windows as anomalies. You'll want to tune this based on your environment - a cluster with frequent but harmless restarts might need a higher threshold; a stable production cluster could go lower.
Step 3 - Scoring current pod behavior
Once the model is trained, you run it continuously against recent data. Here's a scorer that returns an anomaly score and a human-readable risk level for each pod:
Step 4 - Triggering investigation before the alert fires
When a pod hits "high" or "critical" risk level, you want to start an investigation immediately - not wait for the threshold alert to fire. The cleanest way to integrate with OpsWorker is to send a synthetic alert via the same Prometheus AlertManager webhook that OpsWorker already listens to.
This keeps your integration clean - OpsWorker receives a standard Alertmanager payload, runs its normal investigation workflow, and delivers root cause to Slack. The only difference is that the trigger is your anomaly detector rather than a threshold breach.
The annotation fields are worth paying attention to. OpsWorker uses the alert payload as context for its investigation - the more specific you are about what the anomaly detector observed (score, restart rate, trend), the richer the investigation context will be.
Step 5 - Putting it together as a scheduled job
The full loop runs as a Kubernetes CronJob every 10 minutes:
In production, you'd persist the trained models (pickle or joblib to a PVC or S3 bucket) and retrain weekly rather than on every startup. The scan itself should run every 5-15 minutes - frequent enough to catch accelerating trends early, infrequent enough not to flood OpsWorker with duplicate investigations.
What OpsWorker does with a predictive alert

When OpsWorker receives a PredictivePodInstability alert, it runs the same investigation it would for any other alert type - topology discovery, log analysis, resource correlation, configuration review. The difference is timing.
A pod that's been restarting with increasing frequency but hasn't yet crossed a hard threshold is in a more debuggable state than one that's actively crashing. Logs are cleaner. Resource metrics show a gradual trend rather than a spike. Recent deployments or configuration changes are still clearly correlated in the timeline.
In practice, predictive investigations tend to surface a clearer root cause than post-failure investigations, because the signal-to-noise ratio in the data is better when the system is degrading rather than failed.
The investigation output arrives in Slack with the standard format - affected resources, root cause with confidence level, specific remediation steps. The summary will include the context from your annotation fields, making it clear this was a predictive trigger rather than a live incident.
Honest limitations of this approach
Model drift is real. Kubernetes environments change constantly - new deployments, scaling events, updated resource limits. A model trained seven days ago may have a miscalibrated baseline by the time it's scoring today's data. Weekly retraining is the minimum; daily is better for volatile clusters.
False positives are inevitable. Any anomaly detector with a 5% contamination threshold will flag roughly 5% of normal behavior as anomalous. In a cluster with 200 pods, that's 10 false-positive investigations per scan cycle if you're not careful. Add a deduplication layer - don't trigger a new investigation for a pod that already has an open one from the last 30 minutes.
Restart count is a lagging indicator. By the time restarts are accelerating, something has already started failing. Combining restart data with memory trend analysis (approaching the limit before crossing it) and error rate from application metrics gives you earlier signal.
This doesn't replace threshold alerting. Predictive analysis catches slow-burn failures. Sudden failures - a bad deployment, a dependent service going down - will still fire your threshold alerts first. These two approaches are complementary, not alternatives.
Where this fits in the broader incident management picture
The reactive investigation workflow - alert fires, OpsWorker investigates, root cause delivered to Slack - handles the majority of incidents well. Predictive analytics adds a proactive layer on top: a small percentage of failures are slow-building enough that early detection changes the outcome from incident-and-recovery to problem-identified-and-fixed-before-users-notice.
That's worth building for. Not because it replaces your existing tooling, but because the incidents it catches early are often the ones that would have been hardest to investigate after the fact.
The code in this post is a starting point, not a production system. If you build on it, instrument the false positive rate from the start - that metric will tell you more about how well your model is calibrated than any other signal.
Sources and further reading
- Prometheus HTTP API documentation
- scikit-learn IsolationForest
- Alertmanager webhook receiver documentation
- Google SRE Book - Chapter 6, Monitoring Distributed Systems
The OpsWorker team is building automated Kubernetes investigation infrastructure. If you're implementing predictive alerting and want to talk through the architecture, reach out at opsworker.ai.
Subscribe to our email newsletter and unlock access to members-only content and exclusive updates.
Comments