Assignments creation and start issues

Incident Report for Codio

Postmortem

Incident Summary

Intermittent connection issues to a specific microservices were experienced, while other connections on the same host remained stable. Subsequent issues with new project creation surfaced, linked to timeouts. Root cause identified as an unstable AWS managed Redis node causing random request failures for specific keys. Resolution involved recreating AWS resources and implementing additional monitoring, notifications, and regular node rotation.

Impact

  • Intermittent connection issues to one microservice.
  • Issues with creating new projects.
  • Potential disruption to user experience.

Root Cause

An unstable node within the AWS managed Redis instance was causing random request failures due to timeouts. This node, when becoming the master, led to issues in several microservices despite normal monitoring metrics.

Resolution

  • Retired and provisioned new instances to temporarily resolve connectivity issues.
  • Recreated the AWS managed Redis resources to eliminate the problematic node.

Corrective and Preventative Actions

Corrective Actions

  • Recreated AWS managed Redis resources.

Preventative Actions

  • Added additional monitoring and notifications for Redis instance failures.
  • Implemented regular rotation of Redis nodes to prevent long-running software from accumulating bugs.

Lessons Learned

What went well

  • Team quickly identified and addressed initial connectivity issues.
  • Root cause was eventually identified despite challenges.

What could be improved

  • Faster identification of the Redis node as the root cause.
  • More robust monitoring for managed services to detect subtle failures.
Posted Apr 09, 2025 - 01:07 BST

Resolved

This incident has been resolved.
Posted Apr 08, 2025 - 21:45 BST

Monitoring

A fix has been implemented and we are monitoring the results.
Posted Apr 08, 2025 - 21:41 BST

Identified

The issue has been identified and a fix is being implemented.
Posted Apr 08, 2025 - 21:24 BST

Investigating

We are currently investigating this issue.
Posted Apr 08, 2025 - 21:10 BST
This incident affected: Application.