Postmortem: Logto auth service outage
On June 12, 2025, Logto services on `logto.app` were briefly disrupted by a Cloudflare outage affecting request routing. The issue was resolved quickly, with no impact on data security or core services.
Incident summary
Between 18:07 and 18:58 UTC on June 12, 2025, users accessing Logto services via the logto.app
domain (including custom domains) experienced errors. The disruption was due to an outage in Cloudflare Workers KV, which impacted our routing layer. Logto Cloud services and Logto Console, which use direct DNS resolution and do not depend on Cloudflare Workers, were not affected. Service was restored within an hour, with no impact on data security.
Timeline (UTC)
- 18:07: Logto auth service APIs began returning 500 errors for requests via
logto.app
. - 18:24: Investigation confirmed the Azure backend was healthy, and the issue was isolated to Cloudflare Workers and KV.
- 18:48: Cloudflare officially acknowledged an incident affecting Workers and KV.
- 18:58: We deployed a temporary workaround by removing the caching logic, which restored service with possible minor performance degradation.
- 21:00: After Cloudflare services stabilized, we redeployed the cache logic with a graceful fallback. Full performance was restored and the service is now resilient to similar KV outages.
Root cause
This incident was triggered by downtime in Cloudflare Workers KV. Our Cloudflare Worker routes requests to the correct Logto region for each tenant or domain to ensure proper data residency and compliance. To improve performance, the Worker uses KV to cache these region mappings. When KV became unavailable, cache operations failed and the Worker threw errors instead of falling back to a no-cache behavior, causing service disruption.
Logto Cloud services and Logto Console were not affected because they rely on direct DNS resolution and do not use Cloudflare Workers for routing.
Resolution and improvements
- Removed the caching dependency from the Worker, restoring service.
- After Cloudflare KV recovered, redeployed the cache logic with a graceful fallback. If the cache is unavailable, service continues using direct routing without disruption.
- Ongoing improvements to infrastructure to further increase resilience and availability.
Impact
- Users accessing Logto via
logto.app
experienced errors for about 50 minutes. - No customer data was lost or compromised.
- Logto Cloud services and Logto Console remained fully operational.
Next steps
- We will review and improve our error handling in edge infrastructure.
- We will explore using multiple vendors for upstream infrastructure to avoid single points of failure.
Thank you for your patience and support.