Skip to main content

Observability AI Agent

Overview

The Observability AI Agent queries your Grafana instance via the Grafana MCP integration to retrieve metrics, logs, dashboards, alert rules, and incident data during investigations and chat sessions.

This agent is powered by a Grafana MCP sidecar that runs alongside the OpsWorker Kubernetes Agent and communicates with your Grafana instance using a service account token.

Requirements

  • Grafana MCP integration configured for the cluster (separate from Grafana Alerting)
  • Grafana service account with Viewer role
  • See Grafana MCP Setup

Capabilities

CapabilityDescriptionExamples
PromQL queriesExecute Prometheus queries via Grafana's Prometheus datasourceCPU usage, memory trends, request rates, latency percentiles, histograms
Loki log searchRun LogQL queries via Grafana's Loki datasourceError pattern detection, log volume analysis, structured log filtering
Dashboard inspectionSearch, browse, and retrieve data from Grafana dashboardsFind relevant dashboards by service name, view panel data
Datasource queriesQuery any configured Grafana datasourcePrometheus, Loki, CloudWatch, ClickHouse, Elasticsearch
Alert rule inspectionView Grafana alert rules and notification policiesCheck which alerts are configured, their thresholds and states
Incident browsingView Grafana incidents and on-call schedulesCheck active incidents, who's on call
Annotation retrievalRead Grafana annotations for event correlationDeployment markers, incident start/end times
Deep link generationCreate direct links to Grafana viewsLink to specific dashboards, time ranges, panels

Investigation Enhancement

When Grafana MCP is active, multiple investigation agents gain Grafana capabilities:

Investigation AgentWhat Grafana Adds
investigateCorrelates alerts with historical metric trends via PromQL, searches logs for error patterns via Loki, inspects related Grafana alerts and incidents, checks on-call context
analyze_logsQueries Loki for log patterns and statistics, cross-references log error spikes with metric anomalies, detects elevated error rates
validate_resourcesChecks actual CPU and memory utilization metrics from Prometheus to validate whether resource limits are appropriate
check_dependenciesQueries service-level request metrics and latency to identify degrading upstream or downstream services

Use Cases

During Investigations

The Observability Agent enriches investigations with metric and log context:

  • "Prometheus shows memory usage ramping linearly from 12:00 UTC — correlates with the OOMKill event at 14:23 UTC"
  • "Loki logs show 47 connection timeout errors to redis-cache in the 30 minutes before the alert"
  • "Grafana alert rule HighCPUUsage has been firing intermittently for the past 3 days — this is a recurring pattern"
  • "The latency dashboard shows a gradual increase starting 3 hours before the alert, coinciding with a traffic ramp"

In AI Chat

Query your Grafana instance through natural language:

Run a PromQL query: rate(http_requests_total{namespace="production",code=~"5.."}[5m])
Search Loki logs for "connection refused" errors in namespace production
What Grafana dashboards exist for the payment service?
Show me the CPU and memory utilization panels from the API gateway dashboard
Are there any Grafana alert rules currently firing?
Who's currently on call according to Grafana?
What annotations were added in the last 24 hours? (looking for deployment markers)

Grafana MCP vs Grafana Alerting

These are separate integrations that serve different purposes:

Grafana AlertingGrafana MCP
DirectionGrafana → OpsWorkerOpsWorker → Grafana
PurposeSend alerts for investigationAI queries metrics, logs, dashboards
SetupWebhook contact point in GrafanaService account + MCP sidecar
RequiresGrafana 9+ unified alertingAny Grafana version with API access
IndependentWorks without MCPWorks without Alerting

Using both together provides the richest experience: alerts trigger investigations, and the AI uses MCP to pull metrics and logs for deeper root cause analysis.

Failure Isolation

The Grafana MCP sidecar and the Kubernetes Agent run as independent MCP sessions. If Grafana is down or the MCP sidecar is unavailable, all Kubernetes investigation tools continue to work normally. The investigation will simply have less observability context.

Next Steps