Skip to main content

Suggested Fixes

Overview

Every OpsWorker investigation includes actionable recommendations organized into two categories: immediate actions to resolve the current issue, and preventive measures to stop it from recurring.

Recommendation Categories

Immediate Actions

Steps to fix the problem right now:

  • Specific to the current issue and environment
  • Include resource names, namespaces, and values from your cluster
  • Designed to restore service as quickly as possible
  • Typically include kubectl commands you can copy and execute

Example:

Immediate Action: Restart the deployment to recover from the OOM condition:

kubectl rollout restart deployment/api-gateway -n production

Then increase the memory limit to prevent immediate recurrence:

kubectl set resources deployment/api-gateway -n production --limits=memory=512Mi

Preventive Measures

Longer-term fixes to address the root cause:

  • Address the underlying issue, not just the symptom
  • May involve code changes, configuration updates, or architectural adjustments
  • Help break the cycle of recurring alerts

Example:

Preventive Measure: Investigate the connection pool leak identified in the logs. The maxConnections setting in the application config should be bounded. Consider adding a connection pool metrics exporter to catch this trend earlier.

Contextual Recommendations

Recommendations are tailored to your specific environment:

  • Uses actual resource names, namespaces, and configuration values
  • Accounts for your cluster's topology (e.g., related services that may be affected)
  • Considers the specific alert type and observed failure pattern
  • References real data from the investigation (log lines, event details)

Viewing Recommendations

Recommendations appear in:

  • Slack notification — Summary with key actions and commands
  • Portal investigation detail — Full recommendations with context
  • Investigation chat — Ask follow-up questions about any recommendation

Next Steps