IMPORTANT: networking issues in London

Incident Report for GOV.UK Platform as a Service (PaaS)

Postmortem

Incident Report: networking issues in London on 25 August 2020

Hello users of GOV.UK Platform as a Service,
We apologise for any disruption to you and your users on 25 August due to networking issues caused by a failed cooling system at our hosting provider in London.
We take the reliability of GOV.UK PaaS seriously and in this post, we explain why the incident happened and what is being done to prevent it happening in the future.
Please email either of our Product Managers, or reach out on Cross-Government Slack, if you have any feedback:
Emily Labram: emily.labram@digital.cabinet-office.gov.uk
Mark Buckley: mark.buckley@digital.cabinet-office.gov.uk

Regards,
The GOV.UK PaaS team

What happened

Starting at 10:05 BST on August 25th, some users experienced connectivity issues for some instances running in a single Availability Zone in our London Region. Instances in our Ireland Region were unaffected.

Users may have been impacted in the following ways:

Some user requests may have failed
Users were only able to access services intermittently

Your work may have been impacted in the following ways:

Some requests to GOV.UK PaaS APIs may have failed
The ability to browse the GOV.UK PaaS admin interface may have been affected by timeouts

By 12:50 BST on August 25th, power and networking connectivity had been restored to the majority of affected instances. After monitoring all our systems to assure normal service had resumed, we resolved the incident at 18:08 BST.

Why the incident happened

This was caused by a faulty cooling system used in one of the data centres used by our hosting provider. It caused a number of servers to power themselves down to prevent damage. Instances impacted by the fault were able to fail over to an unaffected Availability Zone and mitigate for some of the disruption.

A number of platform components that use the same hosting environment were also affected but took longer to show they had been impacted, as such, these may not have failed over to unaffected instances as quickly. You can contact us to find out about which specific components were affected.

What’s being done to prevent this happening again

We have met with our hosting provider since the incident who took us through the root cause of the issue and have given us assurances about what they are doing about the faulty cooling system and preventing this happening again.

We are arranging another session to assess how we can discover issues more quickly and what coordination of an incident looks like between our two organisations.

Posted Sep 08, 2020 - 12:56 BST

Resolved

We have received multiple updates from our hosting provider indicating that the cause and symptoms of the networking issues have been almost completely resolved.

Our monitoring suggests that all systems are operational and normal service has resumed. We are considering this incident resolved.

Regards,

Toby Lorne
SRE @ GOV.UK PaaS team

Posted Aug 25, 2020 - 18:08 BST

Update

Our hosting provider has given another status update indicating that the problem has been addressed further, although complete connectivity has not been restored.

Their message includes the following: "By 12:50 BST, power and networking connectivity had been restored to the majority of affected instances" (timezone adjusted)

Our monitoring suggests that all systems are operational and normal service has resumed. We are monitoring the issue and expect to be able to provide a further update at 1800 BST.

Regards,

Toby Lorne
SRE @ GOV.UK PaaS team

Posted Aug 25, 2020 - 16:02 BST

Monitoring

Our hosting provider has given a another status update indicating that the problem has been addressed further, although complete connectivity has not been restored.

Our monitoring suggests that all systems are operational and normal service has resumed. We are monitoring the issue and expect to be able to provide further updates at 1600 BST and 1800 BST.

Regards,

Toby Lorne
SRE @ GOV.UK PaaS team

Posted Aug 25, 2020 - 14:03 BST

Update

Our hosting provider has given a status update indicating that the problem has mostly been addressed, although full connectivity has not been restored.

Your users may still experience degraded performance of your applications in the London region.

You may still experience degraded performance of GOV.UK PaaS APIs and the GOV.UK PaaS admin interface.

We are continuing to follow updates and monitor our infrastructure.

Regards,

Toby Lorne
SRE @ GOV.UK PaaS team

Posted Aug 25, 2020 - 13:31 BST

Identified

Our hosting provider has confirmed that they are experiencing operational issues in their London region.

Your users will still experience degraded performance of your applications in the London region.

You will still experience degraded performance of GOV.UK PaaS APIs and the GOV.UK PaaS admin interface.

We are continuing to follow their updates and monitor our infrastructure.

Regards,

Toby Lorne
SRE @ GOV.UK PaaS team

Posted Aug 25, 2020 - 11:51 BST

Investigating

Hello,

We are aware of and investigating a problem with GOV.UK PaaS.

Our London region is experiencing networking issues which we believe is due to our hosting provider.

This may impact your users:
- Some user requests may fail
- Your users can only access your service intermittently

This may impact your work:
- Some of your requests to GOV.UK PaaS APIs may fail
- Your ability to browse the GOV.UK PaaS admin interface may be affected by timeouts

We’re looking into this as a matter of urgency and will update you as soon as we know more.

Regards,

Toby Lorne
SRE @ GOV.UK PaaS team

Posted Aug 25, 2020 - 11:09 BST

This incident affected: London (API - availability of the Cloud Foundry API to tenants, Apps - availability of tenant applications to end users).