Postmortem: unexpected 500 error occurred during user sign-in

Incident report for the unexpected 500 error returned from authentication services on Jul 18, 2024.

Charles

Developer

7/18/20242 min read

Stop wasting weeks on user auth

Launch secure apps faster with Logto. Integrate user auth in minutes, and focus on your core product.

Get started

Summary

On Jul 18, 2024, Logto Cloud experienced a service outage with 500 Internal server error from the authentication services.

Affected users: All Cloud users attempting to authenticate
Affected regions: Europe and US
Severity: Critical, disrupting user sign-in experience

Root cause

During a recent Cloud deployment, a breaking change in the database schema caused the sign-in experience API to fail during the transition between the staging and production environments.

Timeline

2024-07-18 08:57 (UTC): Updates deployed to Logto Cloud
2024-07-18 09:28 (UTC): First user reported a 500 error
2024-07-18 09:31 (UTC): Dev team acknowledged the issue and began investigating
2024-07-18 09:32 (UTC): The issue was automatically resolved
2024-07-18 09:40 (UTC): Root caused identified

Incident analysis

What is the database breaking change, and Why?

We are currently developing a new feature called "Bring your UI", which allows users to customize the Logto sign-in experience with their own web pages. This feature requires a new column in the sign-in-exp table to store the custom UI configuration.

Due to some requirement changes during the development, the feature release was delayed, but the first part of the schema change was already deployed to the production several weeks ago, despite not being in use yet. An update of the database column was introduced in this PR.

Unfortunately, this change was not backward compatible, causing API requests from the old code to fail when communicating with the new database.

How do we deploy a new version of Logto Cloud?

When deploying a new version of Logto Cloud, we first deploy it to the staging environment and then swap the staging and production environments. The process is as follows:

Run database alteration script and update the database.
Deploy the new source code to the staging server.
Run staging server and perform tests.
Swap the staging and production servers so that the "staging" becomes "production", allowing users to access the new version without downtime.

However, both environments share the same database, and the entire process takes time. So in the time window between the database update and environment swap, online users remain in the production environment with the old code but attempt to communicate with the new database.

This was the root cause of the incident and the reason why it was automatically resolved in 35 minutes.

Why was this not addressed in the code review process?

We DO have a CI task to check the backward compatibility of the database changes. However, previously it was not required to pass the CI check before merging the PR. This is because most of the time the development phase are usually short within a few sprints, and the first and second part of the schema changes are usually included in the same release phase.

This time, the feature release was delayed, spreading the schema changes across two releases. The developer assumed the CI failure was expected and informed the reviewers that it shouldn't block the PR from merging.

A communication gap was definitely there as well, and finally the PR was merged without providing any necessary backward compatibility support.

Lesson learned

When making a breaking change in the database schema, we should always consider the backward compatibility with the old version of source code.
When altering a database column, we should avoid changing it in the schema directly, but instead using the deprecation and migration approach.
The developer should have more awareness of the release process and timeline.

Corrective and preventative measures

✅ The database backward compatibility CI check is now required to pass before merging a PR that contains schema change.
✅ Enphasize the importance of backward compatibility within the team.

Summary#

Root cause#

Timeline#

Incident analysis#

What is the database breaking change, and Why?#

How do we deploy a new version of Logto Cloud?#

Why was this not addressed in the code review process?#

Lesson learned#

Corrective and preventative measures#