Post Incident Review: July 2nd Errors loading CircleCI

Summary and Impact:

On July 2, 2024, at 07:34 UTC, a deployment change in production using the MicroFrontends (MFEs) migration Ingress guide caused Kong routing to break for the circleci.com host. This led to the unavailability of APIs and UI. The issue was resolved by reverting the change in web-ui-v1, restoring Kong routing.

Customer Impact Analysis:

The incident had a major impact as customers were unable to access any UI or ingest pipelines. An estimated 16.5k pipelines were affected based on traffic data.

Background:

The incident was related to merging MicroFrontends (MFEs) in the web-ui-consolidated monorepo. A change in Kong routing led to the breakdown of APIs and UI functionality.

What Happened:

Issues arose due to how Kong handles route priorities, leading to unexpected behavior. The incident was exacerbated by a bug in the Kong router, impacting multiple routes.

Lessons Learned and Future Steps:

  • Conduct incident bot testing and improve incident response protocols.
  • Investigate route priorities and upgrade Kong for better routing.
  • Enhance internal documentation on route priorities to prevent similar incidents.

Timeline:

  • 07:35:46 UTC: Customer Impact Start
  • 07:39:53 UTC: Initial investigation initiated
  • 08:05:45 UTC: Helm Rollback applied, issue resolved
  • 08:06:00 UTC: Customer Impact End
  • 08:38:00 UTC: Incident End

We are committed to improving our systems and processes to prevent such incidents in the future