Hello users of GOV.UK Platform as a Service,
We apologise for any disruption to you and your users on 25 August due to networking issues caused by a failed cooling system at our hosting provider in London.
We take the reliability of GOV.UK PaaS seriously and in this post, we explain why the incident happened and what is being done to prevent it happening in the future.
Please email either of our Product Managers, or reach out on Cross-Government Slack, if you have any feedback:
Emily Labram: emily.labram@digital.cabinet-office.gov.uk
Mark Buckley: mark.buckley@digital.cabinet-office.gov.uk
Regards,
The GOV.UK PaaS team
Starting at 10:05 BST on August 25th, some users experienced connectivity issues for some instances running in a single Availability Zone in our London Region. Instances in our Ireland Region were unaffected.
Users may have been impacted in the following ways:
Your work may have been impacted in the following ways:
By 12:50 BST on August 25th, power and networking connectivity had been restored to the majority of affected instances. After monitoring all our systems to assure normal service had resumed, we resolved the incident at 18:08 BST.
This was caused by a faulty cooling system used in one of the data centres used by our hosting provider. It caused a number of servers to power themselves down to prevent damage. Instances impacted by the fault were able to fail over to an unaffected Availability Zone and mitigate for some of the disruption.
A number of platform components that use the same hosting environment were also affected but took longer to show they had been impacted, as such, these may not have failed over to unaffected instances as quickly. You can contact us to find out about which specific components were affected.
We have met with our hosting provider since the incident who took us through the root cause of the issue and have given us assurances about what they are doing about the faulty cooling system and preventing this happening again.
We are arranging another session to assess how we can discover issues more quickly and what coordination of an incident looks like between our two organisations.