This post outlines a recent issue with the hosting of GOV.UK Platform as a Service (PaaS) and how it was resolved. We apologise for any disruption to you and your users on 01 February 2021 due to networking issues at our hosting provider in London.
Starting at 11:50 on 01 February 2021, some of our users experienced connectivity issues for some instances running in a single Availability Zone in our London Region. Instances in our Ireland Region were unaffected.
Users may have been impacted in the following ways:
Your work may have been impacted in the following ways:
By 16:29 on 01 February 2021, networking connectivity had been restored to the majority of affected instances. After checking and monitoring all our systems to assure normal service had resumed, we resolved the incident at 14:01 on 02 February 2021.
One of our hosting provider’s data centres experienced a rapid rise in temperature due to the failure of their cooling system. This caused overheating and powering down, which resulted in the partial outage of an Availability Zone in the London Region. Some instances impacted by the fault were able to fail over to an unaffected Availability Zone and mitigate for some of the disruption.
As a team we are investigating how we can respond to intermittent Availability Zone failure in the future, with the aim of restoring connectivity to services more quickly.
We’re also reviewing our infrastructure alerting to determine if we need to tune monitoring to improve our incident response processes.
We will continue to meet with our hosting provider to be assured that they are taking action to prevent this happening again.