Observability AI Agent

Overview

The Observability AI Agent queries your Grafana instance via the Grafana MCP integration to retrieve metrics, logs, dashboards, alert rules, and incident data during investigations and chat sessions.

This agent is powered by a Grafana MCP sidecar that runs alongside the OpsWorker Kubernetes Agent and communicates with your Grafana instance using a service account token.

Requirements

Grafana MCP integration configured for the cluster (separate from Grafana Alerting)
Grafana service account with Viewer role
See Grafana MCP Setup

Capabilities

Capability	Description	Examples
PromQL queries	Execute Prometheus queries via Grafana's Prometheus datasource	CPU usage, memory trends, request rates, latency percentiles, histograms
Loki log search	Run LogQL queries via Grafana's Loki datasource	Error pattern detection, log volume analysis, structured log filtering
Dashboard inspection	Search, browse, and retrieve data from Grafana dashboards	Find relevant dashboards by service name, view panel data
Datasource queries	Query any configured Grafana datasource	Prometheus, Loki, CloudWatch, ClickHouse, Elasticsearch
Alert rule inspection	View Grafana alert rules and notification policies	Check which alerts are configured, their thresholds and states
Incident browsing	View Grafana incidents and on-call schedules	Check active incidents, who's on call
Annotation retrieval	Read Grafana annotations for event correlation	Deployment markers, incident start/end times
Deep link generation	Create direct links to Grafana views	Link to specific dashboards, time ranges, panels

Investigation Enhancement

When Grafana MCP is active, multiple investigation agents gain Grafana capabilities:

Investigation Agent	What Grafana Adds
investigate	Correlates alerts with historical metric trends via PromQL, searches logs for error patterns via Loki, inspects related Grafana alerts and incidents, checks on-call context
analyze_logs	Queries Loki for log patterns and statistics, cross-references log error spikes with metric anomalies, detects elevated error rates
validate_resources	Checks actual CPU and memory utilization metrics from Prometheus to validate whether resource limits are appropriate
check_dependencies	Queries service-level request metrics and latency to identify degrading upstream or downstream services

Use Cases

During Investigations

The Observability Agent enriches investigations with metric and log context:

"Prometheus shows memory usage ramping linearly from 12:00 UTC — correlates with the OOMKill event at 14:23 UTC"
"Loki logs show 47 connection timeout errors to redis-cache in the 30 minutes before the alert"
"Grafana alert rule HighCPUUsage has been firing intermittently for the past 3 days — this is a recurring pattern"
"The latency dashboard shows a gradual increase starting 3 hours before the alert, coinciding with a traffic ramp"

In AI Chat

Query your Grafana instance through natural language:

Run a PromQL query: rate(http_requests_total{namespace="production",code=~"5.."}[5m])

Search Loki logs for "connection refused" errors in namespace production

What Grafana dashboards exist for the payment service?

Show me the CPU and memory utilization panels from the API gateway dashboard

Are there any Grafana alert rules currently firing?

Who's currently on call according to Grafana?

What annotations were added in the last 24 hours? (looking for deployment markers)

Grafana MCP vs Grafana Alerting

These are separate integrations that serve different purposes:

	Grafana Alerting	Grafana MCP
Direction	Grafana → OpsWorker	OpsWorker → Grafana
Purpose	Send alerts for investigation	AI queries metrics, logs, dashboards
Setup	Webhook contact point in Grafana	Service account + MCP sidecar
Requires	Grafana 9+ unified alerting	Any Grafana version with API access
Independent	Works without MCP	Works without Alerting

Using both together provides the richest experience: alerts trigger investigations, and the AI uses MCP to pull metrics and logs for deeper root cause analysis.

Failure Isolation

The Grafana MCP sidecar and the Kubernetes Agent run as independent MCP sessions. If Grafana is down or the MCP sidecar is unavailable, all Kubernetes investigation tools continue to work normally. The investigation will simply have less observability context.

Next Steps

Grafana Integration Setup — Configure both Grafana integrations
Multi-Agent Workflows — How the Observability Agent works with other agents
Example Prompts — Grafana-specific prompts to try

Overview​

Requirements​

Capabilities​

Investigation Enhancement​

Use Cases​

During Investigations​

In AI Chat​

Grafana MCP vs Grafana Alerting​

Failure Isolation​

Next Steps​