Observability AI Agent
Overview
The Observability AI Agent queries your Grafana instance via the Grafana MCP integration to retrieve metrics, logs, dashboards, alert rules, and incident data during investigations and chat sessions.
This agent is powered by a Grafana MCP sidecar that runs alongside the OpsWorker Kubernetes Agent and communicates with your Grafana instance using a service account token.
Requirements
- Grafana MCP integration configured for the cluster (separate from Grafana Alerting)
- Grafana service account with Viewer role
- See Grafana MCP Setup
Capabilities
| Capability | Description | Examples |
|---|---|---|
| PromQL queries | Execute Prometheus queries via Grafana's Prometheus datasource | CPU usage, memory trends, request rates, latency percentiles, histograms |
| Loki log search | Run LogQL queries via Grafana's Loki datasource | Error pattern detection, log volume analysis, structured log filtering |
| Dashboard inspection | Search, browse, and retrieve data from Grafana dashboards | Find relevant dashboards by service name, view panel data |
| Datasource queries | Query any configured Grafana datasource | Prometheus, Loki, CloudWatch, ClickHouse, Elasticsearch |
| Alert rule inspection | View Grafana alert rules and notification policies | Check which alerts are configured, their thresholds and states |
| Incident browsing | View Grafana incidents and on-call schedules | Check active incidents, who's on call |
| Annotation retrieval | Read Grafana annotations for event correlation | Deployment markers, incident start/end times |
| Deep link generation | Create direct links to Grafana views | Link to specific dashboards, time ranges, panels |
Investigation Enhancement
When Grafana MCP is active, multiple investigation agents gain Grafana capabilities:
| Investigation Agent | What Grafana Adds |
|---|---|
| investigate | Correlates alerts with historical metric trends via PromQL, searches logs for error patterns via Loki, inspects related Grafana alerts and incidents, checks on-call context |
| analyze_logs | Queries Loki for log patterns and statistics, cross-references log error spikes with metric anomalies, detects elevated error rates |
| validate_resources | Checks actual CPU and memory utilization metrics from Prometheus to validate whether resource limits are appropriate |
| check_dependencies | Queries service-level request metrics and latency to identify degrading upstream or downstream services |
Use Cases
During Investigations
The Observability Agent enriches investigations with metric and log context:
- "Prometheus shows memory usage ramping linearly from 12:00 UTC — correlates with the OOMKill event at 14:23 UTC"
- "Loki logs show 47 connection timeout errors to redis-cache in the 30 minutes before the alert"
- "Grafana alert rule
HighCPUUsagehas been firing intermittently for the past 3 days — this is a recurring pattern" - "The latency dashboard shows a gradual increase starting 3 hours before the alert, coinciding with a traffic ramp"
In AI Chat
Query your Grafana instance through natural language:
Run a PromQL query: rate(http_requests_total{namespace="production",code=~"5.."}[5m])
Search Loki logs for "connection refused" errors in namespace production
What Grafana dashboards exist for the payment service?
Show me the CPU and memory utilization panels from the API gateway dashboard
Are there any Grafana alert rules currently firing?
Who's currently on call according to Grafana?
What annotations were added in the last 24 hours? (looking for deployment markers)
Grafana MCP vs Grafana Alerting
These are separate integrations that serve different purposes:
| Grafana Alerting | Grafana MCP | |
|---|---|---|
| Direction | Grafana → OpsWorker | OpsWorker → Grafana |
| Purpose | Send alerts for investigation | AI queries metrics, logs, dashboards |
| Setup | Webhook contact point in Grafana | Service account + MCP sidecar |
| Requires | Grafana 9+ unified alerting | Any Grafana version with API access |
| Independent | Works without MCP | Works without Alerting |
Using both together provides the richest experience: alerts trigger investigations, and the AI uses MCP to pull metrics and logs for deeper root cause analysis.
Failure Isolation
The Grafana MCP sidecar and the Kubernetes Agent run as independent MCP sessions. If Grafana is down or the MCP sidecar is unavailable, all Kubernetes investigation tools continue to work normally. The investigation will simply have less observability context.
Next Steps
- Grafana Integration Setup — Configure both Grafana integrations
- Multi-Agent Workflows — How the Observability Agent works with other agents
- Example Prompts — Grafana-specific prompts to try