Partial network outage in AWS London region
Postmortem

This post outlines a recent issue with the hosting of GOV.UK Platform as a Service (PaaS) and how it was resolved. We apologise for any disruption to you and your users on 01 February 2021 due to networking issues at our hosting provider in London.

What happened

Starting at 11:50 on 01 February 2021, some of our users experienced connectivity issues for some instances running in a single Availability Zone in our London Region. Instances in our Ireland Region were unaffected. 

Users may have been impacted in the following ways:

  • Some user requests may have failed
  • Users may have only been able to access services intermittently

Your work may have been impacted in the following ways:

  • Partial network outage affecting the availability of tenant applications
  • Elevated incidences of 5XX http errors and elevated latencies

By 16:29 on 01 February 2021, networking connectivity had been restored to the majority of affected instances. After checking and monitoring all our systems to assure normal service had resumed, we resolved the incident at 14:01 on 02 February 2021.

Why the incident happened

One of our hosting provider’s data centres experienced a rapid rise in temperature due to the failure of their cooling system. This caused overheating and powering down, which resulted in the partial outage of an Availability Zone in the London Region. Some instances impacted by the fault were able to fail over to an unaffected Availability Zone and mitigate for some of the disruption.

What’s being done to prevent this happening again

As a team we are investigating how we can respond to intermittent Availability Zone failure in the future, with the aim of restoring connectivity to services more quickly.

We’re also reviewing our infrastructure alerting to determine if we need to tune monitoring to improve our incident response processes.

We will continue to meet with our hosting provider to be assured that they are taking action to prevent this happening again.

Posted Feb 24, 2021 - 16:44 GMT

Resolved
Hello,

The underlying issue with the availability zone has been resolved and we have completed our remediation work.

We’ll now start looking into why and how this happened. In the coming days, we’ll publish an incident report describing the timelines of the event, root cause of the problem, lessons we’ve learned and actions we can take to mitigate the impact.

I’m sorry for the impact that this has had on your users and the service you provide, and the problems this has caused for your team.

Making sure GOV.UK PaaS is constantly available and robust is our priority and we’ll be doing everything we can to minimise the possibility of outage in the future.

The quickest way to get help using the platform is to email us via gov-uk-paas-support@digital.cabinet-office.gov.uk. To let us know how this incident has affected you, please contact the GOV.UK PaaS product manager: mark.buckley@digital.cabinet-office.gov.uk.

Regards,

Paul Dougan
GOV.UK PaaS team
Posted Feb 02, 2021 - 14:01 GMT
Update
Hello,

The underlying issue with the availability zone has been resolved and we are completing remediation work on a small proportion of affected hardware, as recommended by our supplier.

We will notify you when this work is completed.

Once again, we’re sorry for the inconvenience that this has caused to you and your users.

If you need to contact us for help or anything else, please email us via gov-uk-paas-support@digital.cabinet-office.gov.uk

Regards,

Paul Dougan
GOV.UK PaaS team
Posted Feb 02, 2021 - 12:03 GMT
Update
Hello,

We are monitoring the issue of the partial network outage in AWS London region and have seen a significant decrease in 5XX error rates.

We’ll continue to monitor and we’ll let you know when the problem has been resolved, in working hours.

Once again, we’re sorry for the inconvenience that this has caused to you and your users.

If you need to contact us for help or anything else, please email us via gov-uk-paas-support@digital.cabinet-office.gov.uk

Regards,

Paul Dougan
GOV.UK PaaS team
Posted Feb 01, 2021 - 16:57 GMT
Monitoring
Hello,

We are still working to resolve the partial network outage in the AWS London region

Since our last update, we have discovered that:

- some tenant workloads have been redistributed to the working availability zones
- we are experiencing cpu contention in those availability zones

We are continuing to work with AWS and the root cause has been resolved and recovery is under way. We expect that AWS will recover this faster than we are able to scale the platform.

Once again, we’re sorry for the inconvenience that this has caused to you and your users.

If you need to contact us for help or anything else, please email us via gov-uk-paas-support@digital.cabinet-office.gov.uk

Regards,

Paul Dougan
GOV.UK PaaS team
Posted Feb 01, 2021 - 14:44 GMT
Update
We are continuing to investigate this issue.
Posted Feb 01, 2021 - 13:07 GMT
Investigating
Hello,

We are aware of and investigating a problem with GOV.UK PaaS in London.

AWS are reporting connectivity issues that are affecting their eu-west-2 London region

You many be experiencing one or more of the following:

- elevated time outs
- your website/service is down and unavailable to your users
- some user requests may fail
- your users can only access your service intermittently
- intermittent API availability which means you may not be able to update/access your service
- a high number of error messages

We are monitoring the Amazon situation and will provide an update when we have more information to share

Regards,

Paul Dougan
GOV.UK PaaS team
Posted Feb 01, 2021 - 12:46 GMT
This incident affected: London (API - availability of the Cloud Foundry API to tenants, Apps - availability of tenant applications to end users, CDN Route Backing service - availability of platform-provided services, Elasticsearch Backing Service - availability of platform-provided services via Aiven, Logging - availability of app logs to external logging services, Metrics - availability of platform-provided metrics, MySQL Backing service - availability of platform-provided services, Postgres Backing service - availability of platform-provided services, Redis Elasticache Backing service - availability of platform-provided services).