In August, we experienced one incident that resulted in degraded performance across GitHub services.
August 14 23:02 UTC (lasting 36 minutes)
On August 14, between 23:02 and 23:38 UTC, all GitHub services on GitHub.com were inaccessible for all users.
At 22:59 UTC, an erroneous configuration change rolled out to GitHub.com databases that impacted the ability of the database to respond to health check pings from the routing service. This led to these database hosts being considered unhealthy and the production read-only database endpoint became inaccessible. As a result, the application could no longer connect to critical data for read operations. This led to the widespread impact on GitHub.com starting at 23:02 UTC. There was no data loss or corruption during this incident.
We mitigated the incident by reverting the configuration change and confirming restored connectivity to our databases. At 23:38 UTC, traffic resumed and all services recovered to full health. Out of an abundance of caution, we continued to monitor before resolving the incident at 00:30 UTC on August 15.
To prevent recurrence we have implemented additional guardrails in our database change management process. We are also prioritizing several repair items, such as faster rollback functionality and more resilience to dependency failures. Given the severity of this incident, all repair items are being worked on at the highest priority.
Please follow our status page for real-time updates on status changes and post-incident recaps. To learn more about what we’re working on, check out the GitHub Engineering Blog.
Written by