Skip to main content

Debug Production Issues

Scenario

Users report slow response times. No alert has fired yet, but something is clearly wrong. You need to investigate quickly before it escalates.

How OpsWorker Helps

Use AI Chat to interactively query your cluster:

You: Are there any pods with high restart counts in namespace production?

OpsWorker: Found 3 pods with elevated restart counts:
- api-gateway-7d8f9b (5 restarts in last hour)
- cache-worker-3k2m (3 restarts)
- payment-processor-9x1z (2 restarts)
You: Show me the logs for api-gateway — what errors are occurring?

OpsWorker: Recent error logs from api-gateway show:
- Connection timeout to redis-cache:6379 (repeated 47 times in last 30 min)
- "Failed to acquire connection from pool" errors starting at 14:23 UTC
You: What changed in the api-gateway deployment recently?

OpsWorker: [Git Agent] A PR was merged 2 hours ago that changed the Redis
connection pool settings — max connections was reduced from 50 to 10.

Root cause found: A recent config change reduced the Redis connection pool size, causing connection exhaustion under normal load.

Outcome

  • Proactive debugging before an outage occurs
  • Identified root cause through interactive questioning — no need to chain kubectl commands
  • Correlated code change with infrastructure symptoms (requires Git integration)
  • Total time: About 3 minutes of conversation vs. 20+ minutes of manual investigation