Incident:
Our monitoring systems have identified a service disruption in GCP-US environments impacting multiple tenants.
Time of incident:
09th December, Wednesday, 13:25 to 13:33 PST
Impacted Environments:
- test-usg
- prod-usg
- gus-training
- gus-sales
Impacted Services:
API calls, User Interface
Latest Update on 12/09 at 4:30 PM PST
The services are back to Normal. We are investigating the root cause of the incident and revisiting the environment configuration parameters.
Update on 12/09 at 3:10 PM PST
The systems are being monitored closely and the environment parameters for all impacted environments are normal. We are investigating the root cause of the incident.
Timeline of Events:
1. 1:15 PM PST - our Platform health check service detected an increased rate of errors returned by the Data load and API clusters on the health check requests on Reltio Google Cloud
2. 1:20 PM PST - Platform health check service in our Google Cloud failed to receive the health status from the Data load and API clusters of the Reltio Google Cloud and after some attempts considered them down and initiated the auto-recovery procedure
3. Healthcheck Service initiated the rolling restart of the Data load and API clusters
4. By 1:33 PM PST rolling restart of the data load and API clusters was complete and Zabbix reported green status
5. Currently, all clusters remain healthy and green. The team is investigating the root cause of the observed cascade of events.
Comments
Article is closed for comments.