Troubleshooting Connectivity
Agent Shows as Disconnected
Check Pod Status
kubectl get pods -n opsworker
| Pod Status | Likely Cause | Action |
|---|---|---|
Running | Agent running but can't reach SQS | Check network/proxy |
CrashLoopBackOff | Configuration error or resource issue | Check logs |
ImagePullBackOff | Can't pull agent image | Check image registry access |
Pending | Scheduling issue | Check node resources, tolerations |
| Not found | Agent not installed | Run Helm install |
Check Pod Logs
kubectl logs -n opsworker -l app=opsworker-agent
Look for:
- Connection errors → Network/proxy issue
- Authentication errors → Invalid cluster token
- Timeout errors → SQS endpoint unreachable
Verify Outbound Connectivity
Test that the agent can reach AWS SQS:
kubectl exec -n opsworker deploy/opsworker-agent -- \
wget -q -O /dev/null https://sqs.us-east-1.amazonaws.com
If this fails, outbound HTTPS is blocked. Check:
- Security groups (EKS)
- Firewall rules (GKE)
- NSG rules (AKS)
- NetworkPolicy resources
Check Cluster Token
Verify the correct token was used during installation:
helm get values opsworker-agent -n opsworker
If the token is incorrect, update it:
helm upgrade opsworker-agent opsworker/opsworker-agent \
-n opsworker \
--set clusterToken=CORRECT_TOKEN
You can regenerate the token from the OpsWorker portal if needed.
Proxy Configuration
If your cluster is behind a proxy:
helm upgrade opsworker-agent opsworker/opsworker-agent \
-n opsworker \
--set proxy.https=http://proxy.example.com:3128
Agent Connects but Goes Offline Intermittently
Resource Limits
Check if the pod is being OOM-killed:
kubectl describe pod -n opsworker -l app=opsworker-agent | grep -A5 "Last State"
If you see OOMKilled, increase memory limits:
helm upgrade opsworker-agent opsworker/opsworker-agent \
-n opsworker \
--set resources.limits.memory=512Mi
Node Stability
Check if the node hosting the agent is stable:
kubectl get events -n opsworker --sort-by='.lastTimestamp'
Network Intermittency
Intermittent SQS connectivity can cause temporary disconnections. The agent will automatically reconnect.
Next Steps
- Data Collection Troubleshooting — Fix data gathering issues
- Health Checks — Verify system health