Incident Summary
Intermittent connection issues to a specific microservices were experienced, while other connections on the same host remained stable. Subsequent issues with new project creation surfaced, linked to timeouts. Root cause identified as an unstable AWS managed Redis node causing random request failures for specific keys. Resolution involved recreating AWS resources and implementing additional monitoring, notifications, and regular node rotation.
Impact
- Intermittent connection issues to one microservice.
- Issues with creating new projects.
- Potential disruption to user experience.
Root Cause
An unstable node within the AWS managed Redis instance was causing random request failures due to timeouts. This node, when becoming the master, led to issues in several microservices despite normal monitoring metrics.
Resolution
- Retired and provisioned new instances to temporarily resolve connectivity issues.
- Recreated the AWS managed Redis resources to eliminate the problematic node.
Corrective and Preventative Actions
Corrective Actions
- Recreated AWS managed Redis resources.
Preventative Actions
- Added additional monitoring and notifications for Redis instance failures.
- Implemented regular rotation of Redis nodes to prevent long-running software from accumulating bugs.
Lessons Learned
What went well
- Team quickly identified and addressed initial connectivity issues.
- Root cause was eventually identified despite challenges.
What could be improved
- Faster identification of the Redis node as the root cause.
- More robust monitoring for managed services to detect subtle failures.