English
  • postmortem
  • cloud-service
  • incident
  • db-alterations

Postmortem: undeployed database alterations exception

Incident report for the 2025-07-17 undeployed database alterations exception.

Simeng
Simeng
Developer

Stop wasting weeks on user auth
Launch secure apps faster with Logto. Integrate user auth in minutes, and focus on your core product.
Get started
Product screenshot

Brief summary

On 2025-07-17 02:50PM (UTC) we received an incident alert indicating that the Logto core service at AU region was unavailable. Core server deployment was blocked due to an undeployed database alterations exceptions. This incident was caused by a timestamp format bug in our latest database alteration script. The issue was resolved by deploying the latest master branch, which included a fix for the timestamp format bug.

  • Affected users: All users in AU region.
  • Severity: Critical, affecting all users in AU region.
  • Duration: 1 hour.

Background

In Logto, we use database alteration scripts to modify the database schema during each deployment. These scripts are executed automatically during the deployment process to ensure that the database schema is always up-to-date with the latest code changes.

Each alteration script file is named using the format <version>-<timestamp>-<description>.ts, where the timestamp has a precision of seconds (E.g. 1.26.0-1740982044-add-one-time-tokens-table.ts).

After each successful deployment, once all new database alteration scripts have been executed, we record the timestamp of the latest executed script in a dedicated table within the database system. This timestamp is used to track the current status of database alterations.

Before running any database alteration scripts during a deployment, we retrieve this timestamp from the database. We then compare it against the timestamps of the available alteration scripts to determine which scripts are new and need to be executed. This mechanism ensures that every alteration script is tracked in the codebase, but scripts that have already been executed will not run again, preventing duplicate executions.

Timeline

  1. 2025-07-10 03:28AM (UTC): A new deployment was triggered in all regions with the latest master branch. During this deployment, a database alteration script with an invalid timestamp format was included. The deployment succeeded in all regions, and the timestamp record in the database was updated to this invalid format.

  2. 2025-07-16 01:49AM (UTC): We discovered that the timestamp format of the latest alteration script was invalid and created a PR to correct the file name.

  3. 2025-07-16 02:00PM (UTC): Since the alteration script had already been executed on Logto Cloud, we manually updated the timestamp record in all regions to match the valid format.

  4. 2025-07-17 02:49PM (UTC): The AU core service instance triggered an automatic refresh. As a result, the core service failed to start due to undeployed database alterations exceptions. (The timestamp (fixed) retrieved from the database does not match with the timestamp of the legacy alteration script in the codebase.)

  5. 2025-07-17 02:50PM (UTC): We received an incident alert indicating that the Logto core service at AU region was unavailable.

  6. 2025-07-17 02:50PM (UTC): We acknowledged the incident and started investigating.

  7. 2025-07-17 03:30PM (UTC): We identified the root cause of the incident and started a hotfix deployment to the AU region.

  8. 2025-07-17 04:00PM (UTC): The hotfix deployment was completed successfully, and the core service at AU region was restored.

Root cause breakdown

As mentioned in the background section, we use a timestamp format with second-level precision for naming database alteration scripts. The filename of the script that ultimately caused the incident was next-1751529530394-add-enable-token-storage-column-to-sso-connectors-table.ts, which used a timestamp of 1751529530394 (millisecond precision). This format is incompatible with the second-based timestamp convention used in our codebase.

When this script was executed, the timestamp record in the database was updated to this invalid, millisecond-based value. Because this value is much higher than any valid second-based timestamp, the system will incorrectly treat all subsequent alteration scripts with valid timestamps as already deployed, causing them to be skipped.

However, this issue did not immediately cause any problems until we noticed the timestamp format bug and start to fix it.

In this PR, we fixed the timestamp in the script filename from 1751529530394 to 1751529530, but did not deploy this change right away. Instead, we manually updated the timestamp record in the database to the corrected value 1751529530.

This manual update caused the timestamp in the database to no longer match the timestamp of the alteration script in the codebase, which was still 1751529530394. As a result, when the AU core service instance attempted to refresh, it encountered an undeployed database alterations exception, as the system considered the alteration script not executed due to the mismatch in timestamps.

Resolution

To resolve the incident, we deployed the latest master branch, which included the fix for the timestamp format bug. This deployment restored the correct timestamp in the alteration script filename and ensured that the timestamp record in the database matched it.

Takeaways

  1. In the fix PR, we have included a validation step to ensure that the timestamp format of alteration scripts is consistent with the second-based precision convention. Alteration script filenames that contains millisecond-based timestamps will be rejected during the CI process, preventing similar issues in the future.

  2. DO NOT update the timestamp record in the database that does not match the timestamp of the alteration script in the codebase. Should deploy the fix PR first, then update the timestamp record in the database.