Delayed Webhook Delivery
Incident Report for Stitch
Postmortem

Incident Description

On January 25th, 2023 at 10am, an incident occurred that resulted in delayed delivery of webhooks across multiple products. Over 10,000 webhooks were affected by the delay. The incident was identified at 4pm the same day.

Involved Parties

  • 2x Support
  • 3x Engineers on call
  • Additional engineers and support members provided light touch involvement in the investigation and resolution of the incident.

Actions Taken

An investigation was immediately launched by the above-named parties to determine the cause of the incident. It was discovered that only a single dispatcher was active. A queue of webhooks had built up for the dead dispatchers. The webhook dispatchers were restarted and the issue was resolved. The dispatchers began dispatching webhooks in a first-in-first-out (FIFO) order.

Root Cause

The Microsoft Azure outage that was reported earlier in the day was the source of the corrupted webhook dispatchers.

Resolution

The webhook dispatchers were restarted, which resolved the issue and webhooks began to be dispatched in a FIFO order.

Preventative Measures

  • Improve monitoring and alerting of the webhooks dispatchers to ensure they are working evenly.
  • Upskill the entire team on webhook infrastructure
  • Monitor 3rd party statuses (Azure) to reduce 3rd party risk from future outages.

Conclusion

This incident resulted in delayed delivery of webhooks across multiple products. The cause of the incident was determined to be an Azure outage, and the issue was resolved by restarting the webhook dispatchers. Preventative measures have been put in place to prevent similar incidents from occurring in the future.

Posted Jan 26, 2023 - 16:38 SAST

Resolved
This incident has been resolved.
Posted Jan 25, 2023 - 20:26 SAST
Update
We are continuing to monitor for any further issues.
Posted Jan 25, 2023 - 19:03 SAST
Monitoring
A fix has been implemented and we are monitoring the results
Posted Jan 25, 2023 - 19:00 SAST
Identified
Stitch is unfortunately experiencing some operational performance issues. We are investigating the potential issue that may be causing any impact to our clients. We are working to identify the scope of impact and implement a solution as quickly as possible.
Posted Jan 25, 2023 - 10:00 SAST