Major outage of applications hosted on GOV.UK PaaS

Incident Report for GOV.UK Platform as a Service (PaaS)

Postmortem

Incident Report: GOV.UK PaaS outage on 21 April 2020

Dear users of GOV.UK Platform as a Service,

We apologise again for the disruption to you and your users on Tuesday 21 April due to a prolonged outage of our platform.

We take the reliability of GOV.UK PaaS extremely seriously. We know how much was at stake for your users during this outage, particularly in these challenging times.

In this post, we explain why the incident happened, what we are doing to prevent it recurring, and how we will improve our response to any future incidents.

Thank you to those who have given us feedback so far. We welcome your thoughts on the actions we have committed to below.

Please email either of our Product Managers, or reach out on Cross-Government Slack:

Emily Labram: emily.labram@digital.cabinet-office.gov.uk

Mark Buckley: mark.buckley@digital.cabinet-office.gov.uk

Regards,

The GOV.UK PaaS team

What happened

At 5.12pm on Tuesday 21 April, GOV.UK PaaS had a critical outage in both regions (Ireland and London), affecting all tenant services.

Engineers immediately began investigating the issue and the first tenant services came back online 5 hours later at 10.02pm. Services continued to come back online until 12.45am on Wednesday.

How users were affected

End users were unable to access any services hosted on GOV.UK PaaS. Platform users were unable to make any changes via the platform interface or command line.

Why the incident happened

We perform certificate rotation every six months as a routine part of security maintenance for the platform. An automated job rotates core platform security certificates as they approach their expiration date.

The incident was caused by a bug in CredHub. CredHub is the application we use to generate security certificates. The bug caused CredHub to attach the wrong Authority Key Identifier (AKI) to the new certificate. It should have used the certificate authority’s Subject Key Identifier (SKI), but it incorrectly used an AKI generated from the certificate authority’s public key identifier. The incorrect AKI meant that the certificate authority was not considered a valid issuer for the generated certificate.

GOV.UK PaaS’s core platform components use a library called OpenSSL to verify certificate trust chains. CredHub generated certificates it considered to have valid trust chains but OpenSSL did not consider the trust chains to be valid.

As a result, core components in the platform no longer trusted each other, no traffic was able to pass from the routers to the backend cells, and the orchestration platform was unable to know the health of the applications running on the platform.

Why our tests didn’t prevent the incident

We didn’t find out about the bug when testing the certificate rotation process in our development environments. This was because the bug only affects certificates generated from certificate authorities imported into CredHub, and this only occurs for long lived environments such as production and staging. The development environments of the engineers working on the certificate rotation process were using freshly generated certificate authorities and certificates.

We didn’t catch the bug when the staging and production pipelines generated new certificates for the rotation process. We hadn’t set up a test in either pipeline to check that CredHub had generated certificates with valid trust chains. This is because we assumed CredHub would do this correctly. Because of this, the pipelines did not prevent the rotation.

We didn’t discover the bug in CredHub prior to the incident because we didn’t hear about it on either the CloudFoundry blog, or the community channels.

We now know that a Cloud Foundry Foundation contributor raised it as a user story on CredHub’s public roadmap in August 2019. The Cloud Foundry Foundation released a fix in a beta release on 17 April 2020, without mentioning it in the release notes. We upgrade CredHub to final releases - rather than the beta releases - for reliability reasons. The foundation made a final release available on 22 April, the day after our incident.

The issue affected services in both regions because we have typically deployed all changes to both regions simultaneously. As a result, both regions’ certificates reached expiry on the same day, having been originally generated on the same day.

What we’re doing to make sure it does not happen again

Now that we have updated the platform’s security certificates, our next rotation is due to happen in five months. In the meantime, we are committing to these actions:

Next 3 weeks

Upgrade CredHub to a more recent version that fixes the bug that caused the incident.

Include a test of the certificate chain of trust in our deployment pipeline and prevent the pipeline from continuing if this test fails. This test will validate certificates using OpenSSL.

Test our automatic certificate rotation job thoroughly, and run it in staging and development environments more frequently.

Next 3 months

Look into improving error messages to end users if the platform is unavailable.

Find easier, quicker ways to practice incident response and recovery practices. Practice more types of incidents and fixes more often.

Consider even safer, more progressive ways to deploy wide-reaching platform changes.

What we’ll do to improve how we respond

What end users saw

During the incident, most end users would have seen a ‘404 - not found’ error message. This is the default behaviour of Cloud Foundry, but we are aware it would have been unhelpful for end users. We will be looking into whether it is possible to improve these error messages.

Initial response

We classified the incident as “Major” when it was in fact the highest severity, “Critical”. We’ll rehearse our communication process in future to improve the accuracy.

Investigating the issue

It took us over two hours to realise that a platform-wide fix would be necessary. If we had monitored not just the date validity of certificates but also the certificate chain of trust, we would have realised the scale of the impact much faster. This would have prompted us to choose a platform fix more quickly. Better testing will help to prevent this in future.

We were also slow to consider a fix that would, in normal circumstances, cause downtime for users. In hindsight, we should have drilled these aggressive fixes during our incident rehearsals. From now on, we will ensure we practice more types of incidents and fixes, more often.

Restoring service

We deployed the fix in our usual way to our Ireland region, and then realised we could do it faster in our London region. Incident rehearsal will also help us remember these faster deployment methods, so we can restore service to both regions more quickly.

Timeline

Tuesday 21 April 2020

17:12

Following a routine rotation of security certificates, we detected errors on the GOV.UK PaaS platform.

17:15 - 19:55

From 17:18 services hosted on the PaaS started reporting outages. Engineers started by investigating the trust between the different certificates used on the platform. Due to lack of recent rehearsal and visibility of the certificates affected, we initially focused on determining which certificate was being rejected, but then realised that multiple certificates were affected, and a platform-wide fix would be needed.

20:00

We took the decision to force a rotation of all certificates across the whole platform. This replaced the faulty certificates, and re-established trust relationships between the platform components.

20:10 -

Over the next few hours we monitored progress as the platform and the services it hosts started to come back online.

Wednesday 22 April 2020

00:45

We restored full service.

Posted May 01, 2020 - 16:46 BST

Resolved

Hello,

We've now resolved the incident.

Your services should still be working as normal, and we are now confident that our engineers can make changes to the platform.

As previously mentioned, we will publish an incident report in the coming days. We will describe the timelines of the event, root causes of the issue with our certificate rotations process, lessons we’ve learned and actions we will take to ensure it doesn’t happen again.

If you have any feedback that you would like to send us about how this incident impacted your services, please contact the GOV.UK PaaS Senior Product Manager Emily Labram: emily.labram@digital.cabinet-office.gov.uk

Making sure GOV.UK PaaS is constantly available and robust is our priority and we’ll be doing everything we can to minimise the possibility of outage in the future.

Regards,

Jennifer Sleeman
Delivery Manager
GOV.UK PaaS team

Posted Apr 22, 2020 - 17:42 BST

Update

Hello,

As a further update, here is a short summary of what happened during the incident with our platform yesterday, which caused your services to be unavailable to users for up to 7.5 hours.

We will share a full report on our status page as soon as it's available.

What happened:

Following a routine rotation of security certificates at 17:12, we detected errors on the GOV.UK PaaS platform. From 17:18, all services hosted on the PaaS started reporting outages.

What we did to investigate:

Our initial investigation centred around the trust between the different certificates used on the platform. At first we focused on determining which certificate was being rejected. We realised that it was multiple certificates and components which were affected.

What we did to resolve the problem:

At 20:00 we took the decision to force a rotation of all certificates across the whole platform. This re-established trust relationships between the platform components.

Over the next few hours we monitored progress as the platform and the services it hosts started to come back online from 22:00, with full service being restored to all PaaS tenants around 00:45 Wednesday.

What will happen now:

We believe the issue was caused by the sequence in which certificates were rotated.

Further investigation and a full post mortem will be held later this week. We will use this to confirm our understanding of root causes of the incident, and changes to certificate rotation practices to prevent this happening again.

About security certificates:

Certificates are a security measure implemented to ensure that traffic is trusted to access specific applications and components. To ensure security, certificates are re-issued (‘rotated’) periodically. GOV.UK PaaS certificates automatically rotate when they have less than 30 days validity.

If you need to contact us for help or anything else, please email us via gov-uk-paas-support@digital.cabinet-office.gov.uk.

Regards,

Jennifer Sleeman
Delivery Manager
GOV.UK PaaS team

Posted Apr 22, 2020 - 14:10 BST

Update

Hello,

We are continuing to monitor the platform this morning. Your services and websites should be working and your teams should be able to use the platform as normal.

We are currently ensuring our engineers can update the platform in our usual way. Once we have confidence that there will be no further issues we will resolve the incident.

Last night’s outage was caused by an issue during a routine rotation of security certificates. We will be conducting an incident review as soon as we can. We will use this to confirm our understanding of root causes of the incident, and make changes to certificate rotation practices to prevent this happening again.

We will share the incident review report in a further update on statuspage.

If you need to contact us for help or anything else, please email us via gov-uk-paas-support@digital.cabinet-office.gov.uk

Regards,

Jennifer Sleeman
Delivery Manager
GOV.UK PaaS team

Posted Apr 22, 2020 - 11:36 BST

Monitoring

Hello

UPDATE: SERVICES ARE RESTORED

Since our last update, all GOV.UK PaaS instances are up and running, and tenant services should be able to run normally.

We’ll continue to monitor the platform and will close this incident in the morning if there are no further issues.

Once again, we’re sorry for the inconvenience that this has caused to you and your users.

If you need to contact us for help or anything else, please email us via gov-uk-paas-support@digital.cabinet-office.gov.uk

Regards,

Julia Harrison
on behalf of Emily Labram
GOV.UK PaaS team

Posted Apr 22, 2020 - 00:58 BST

Update

We are continuing to investigate this issue.

Posted Apr 22, 2020 - 00:35 BST

Update

UPDATE: APPLICATIONS STARTING TO COME ONLINE

Since our last update, we have made progress identifying the cause of our incident, and towards resolving it.

Some services are starting to come back online.

We recommend that you continue to monitor your service as we continue to restore service. Some services will take longer to fully restore, up to a couple of hours, due to the way the platform allocates resources while it's restarting.

We will update again as soon as we're confident that we have restored service all round.

Once again, thank you for your patience and support as we restore service.

And sincere apologies again for the continued impact.

Regards,

Emily Labram
Senior Product Manager
GOV.UK PaaS team

Posted Apr 21, 2020 - 22:28 BST

Update

We're continuing to work very hard to investigate this outage.

Thank you again for your patience and support. It's really appreciated.

Posted Apr 21, 2020 - 20:20 BST

Update

We are continuing to investigate this issue.

Posted Apr 21, 2020 - 19:29 BST

Update

Hello,

We are still working to resolve this ongoing major outage, as our highest priority.

We’ll continue to update you as we know more, and we’ll let you know as soon as the problem has been resolved.

Our sincere apologies again for the continued impact on you and your users.

Regards,

Emily Labram
Senior Product Manager
GOV.UK PaaS team

Posted Apr 21, 2020 - 19:13 BST

Update

We are continuing to investigate this issue.

Posted Apr 21, 2020 - 18:47 BST

Update

Hello,

We are actively working on the issue of the major outage affecting your services.

We’re in the process of recovering traffic from our routers to the applications.

Fixing this issue is our priority - we know that this has already had a huge impact on the service you provide and we’re doing everything we can to resolve it as quickly as possible.

We’ll continue to update you as we know more, and we’ll let you know as soon as the problem has been resolved.

We’re sorry for the continued impact to your users and your team.

If you need to contact us for help or anything else, please email us via gov-uk-paas-support@digital.cabinet-office.gov.uk

Regards,

Emily Labram
GOV.UK PaaS team

Posted Apr 21, 2020 - 18:23 BST

Investigating

Hello,

We have now declared a major incident.

It's likely that your website or service is currently down and unavailable to your users.

We are aware of the continued impact to your users.

We’re looking into this as a matter of urgency and will update you as soon as we know more.

Regards,

Emily Labram
Senior Product Manager
GOV.UK PaaS team

Posted Apr 21, 2020 - 18:01 BST

Update

We are continuing to work on a fix for this issue.

Posted Apr 21, 2020 - 17:43 BST

Update

We are continuing to work on a fix for this issue.

Posted Apr 21, 2020 - 17:40 BST

Identified

Hello,

We are currently resolving an issue where not all app logs are being delivered to their desired destinations.

This may impact your use of the "cf logs" command, as well as log delivery from your apps.

This is affecting both London and Ireland regions.

Application availability is unaffected.

Regards,
Toby Lorne
GOV.UK PaaS team

Posted Apr 21, 2020 - 16:57 BST

This incident affected: Ireland (Apps - availability of tenant applications to end users, Logging - availability of app logs to external logging services) and London (Apps - availability of tenant applications to end users, Logging - availability of app logs to external logging services).