Stitch API unavailable due to unexpected Azure downtime

Incident Report for Stitch

Postmortem

Stitch API unavailable due to unexpected Azure downtime

Incident Description

Stitch experienced unexpected downtime due to an unplanned outage from one of our hosting providers, Azure, from 11:30pm to 06:00am SAST.

Involved Parties

  • Platform reliability engineers.

Actions Taken

  • An immediate assessment of affected services and products was conducted.
  • Database backup restoration processes were initiated.

Root Cause

Stitch experienced unexpected downtime due to an unplanned outage from one of our hosting providers, Azure, which rendered Stitch's Postgres DBs inaccessible. The official response from Azure at the time was that the scheduled maintenance for our Azure Database for PostgreSQL - Flexible server was taking longer than expected. The downtime from this maintenance was unplanned.

Resolution

A replica database was restored using a previous backup.

Preventative Measures

  • Automated production database backup restoration.
  • Enable High availability for automatic failovers
  • Alerting on Azure maintenance even when downtime is not expected.
  • Configure Azure Postgres maintenance window away from default to a window within observable hours.

Conclusion

Microsoft Azure’s postgres DB infrastructure experienced severe downtime as a result of unplanned downtime during routine maintenance which resulted in degraded API performance, database inaccessibility and delayed webhook dispatches.

Microsoft Azure will be providing a detailed root cause analysis within the next 5 business days at which time we’ll append to this report.

Posted May 26, 2023 - 11:51 SAST

Resolved

This incident has been resolved.
Posted May 26, 2023 - 07:17 SAST

Update

Stitch has switched to a redundant provider and services are back to being fully operational. We will continue to closely monitor this service's health.
Posted May 26, 2023 - 06:31 SAST

Update

We are continuing to monitor for any further issues.
Posted May 26, 2023 - 06:17 SAST

Monitoring

The Stitch API is currently unavailable due to unexpected downtime with Azure's PostgreSQL servers
Posted May 26, 2023 - 00:30 SAST
This incident affected: Stitch (Stitch API).