Postmortem: unexpected 500 error occurred during user sign-in
Incident report for the unexpected 500 error returned from authentication services on Jul 18, 2024.
Summary
On Jul 18, 2024, Logto Cloud experienced a service outage with 500 Internal server error from the authentication services.
- Affected users: All Cloud users attempting to authenticate
- Affected regions: Europe and US
- Severity: Critical, disrupting user sign-in experience
Root cause
During a recent Cloud deployment, a breaking change in the database schema caused the sign-in experience API to fail during the transition between the staging and production environments.
Timeline
- 2024-07-18 08:57 (UTC): Updates deployed to Logto Cloud
- 2024-07-18 09:28 (UTC): First user reported a 500 error
- 2024-07-18 09:31 (UTC): Dev team acknowledged the issue and began investigating
- 2024-07-18 09:32 (UTC): The issue was automatically resolved
- 2024-07-18 09:40 (UTC): Root caused identified
Incident analysis
What is the database breaking change, and Why?
We are currently developing a new feature called "Bring your UI", which allows users to customize the Logto sign-in experience with their own web pages. This feature requires a new column in the sign-in-exp
table to store the custom UI configuration.
Due to some requirement changes during the development, the feature release was delayed, but the first part of the schema change was already deployed to the production several weeks ago, despite not being in use yet. An update of the database column was introduced in this PR.
Unfortunately, this change was not backward compatible, causing API requests from the old code to fail when communicating with the new database.
How do we deploy a new version of Logto Cloud?
When deploying a new version of Logto Cloud, we first deploy it to the staging environment and then swap the staging and production environments. The process is as follows:
- Run database alteration script and update the database.
- Deploy the new source code to the staging server.
- Run staging server and perform tests.
- Swap the staging and production servers so that the "staging" becomes "production", allowing users to access the new version without downtime.
However, both environments share the same database, and the entire process takes time. So in the time window between the database update and environment swap, online users remain in the production environment with the old code but attempt to communicate with the new database.
This was the root cause of the incident and the reason why it was automatically resolved in 35 minutes.
Why was this not addressed in the code review process?
We DO have a CI task to check the backward compatibility of the database changes. However, previously it was not required to pass the CI check before merging the PR. This is because most of the time the development phase are usually short within a few sprints, and the first and second part of the schema changes are usually included in the same release phase.
This time, the feature release was delayed, spreading the schema changes across two releases. The developer assumed the CI failure was expected and informed the reviewers that it shouldn't block the PR from merging.
A communication gap was definitely there as well, and finally the PR was merged without providing any necessary backward compatibility support.
Lesson learned
- When making a breaking change in the database schema, we should always consider the backward compatibility with the old version of source code.
- When altering a database column, we should avoid changing it in the schema directly, but instead using the deprecation and migration approach.
- The developer should have more awareness of the release process and timeline.
Corrective and preventative measures
- ✅ The database backward compatibility CI check is now required to pass before merging a PR that contains schema change.
- ✅ Enphasize the importance of backward compatibility within the team.