Stitch is experiencing some intermittent issues on the API

Incident Report for Stitch

Postmortem

Stitch API Downtime due to deploy

Incident Description

Stitch API experienced degraded API performance and access to our UI which resulted in timeouts and bad request statuses (400-403) across all products and clients. The outage lasted approximately 24 minutes.

Involved Parties

  • Platform reliability engineers and service deployment team.

Actions Taken

The relevant engineering teams begun work on a identifying the root cause of the downtime.

A manual redeployment of all services was conducted which resulted in a fix shortly after identified.

Root Cause

During a routine deployment, not all internal services were successfully deployed. This meant that the interface versions used to communicate between one another became out of sync.

Resolution

A manual redeployment of services was done to ensure that all services were aligned and using the correct interfaces with one another.

Preventative Measures

We are enforcing stricter service deployment rules to ensure services still operate even when there are version mismatches

Conclusion

This incident resulted in degraded performance across our API service and was the result of an issue with a github action during a routine deploy. The team immediately retried this action which resolved the issue.

Posted Mar 13, 2023 - 13:03 SAST

Resolved

Stitch is experiencing some operational downtime. We are investigating the potential issue that may be causing any impact on our clients. We are working to identify the scope of impact and implement a solution as quickly as possible.
Posted Mar 13, 2023 - 12:25 SAST