Dear users of GOV.UK Platform as a Service,
We apologise again for the disruption to you and your users on Tuesday 21 April due to a prolonged outage of our platform.
We take the reliability of GOV.UK PaaS extremely seriously. We know how much was at stake for your users during this outage, particularly in these challenging times.
In this post, we explain why the incident happened, what we are doing to prevent it recurring, and how we will improve our response to any future incidents.
Thank you to those who have given us feedback so far. We welcome your thoughts on the actions we have committed to below.
Please email either of our Product Managers, or reach out on Cross-Government Slack:
Emily Labram: email@example.com
Mark Buckley: firstname.lastname@example.org
The GOV.UK PaaS team
At 5.12pm on Tuesday 21 April, GOV.UK PaaS had a critical outage in both regions (Ireland and London), affecting all tenant services.
Engineers immediately began investigating the issue and the first tenant services came back online 5 hours later at 10.02pm. Services continued to come back online until 12.45am on Wednesday.
End users were unable to access any services hosted on GOV.UK PaaS. Platform users were unable to make any changes via the platform interface or command line.
We perform certificate rotation every six months as a routine part of security maintenance for the platform. An automated job rotates core platform security certificates as they approach their expiration date.
The incident was caused by a bug in CredHub. CredHub is the application we use to generate security certificates. The bug caused CredHub to attach the wrong Authority Key Identifier (AKI) to the new certificate. It should have used the certificate authority’s Subject Key Identifier (SKI), but it incorrectly used an AKI generated from the certificate authority’s public key identifier. The incorrect AKI meant that the certificate authority was not considered a valid issuer for the generated certificate.
GOV.UK PaaS’s core platform components use a library called OpenSSL to verify certificate trust chains. CredHub generated certificates it considered to have valid trust chains but OpenSSL did not consider the trust chains to be valid.
As a result, core components in the platform no longer trusted each other, no traffic was able to pass from the routers to the backend cells, and the orchestration platform was unable to know the health of the applications running on the platform.
We didn’t find out about the bug when testing the certificate rotation process in our development environments. This was because the bug only affects certificates generated from certificate authorities imported into CredHub, and this only occurs for long lived environments such as production and staging. The development environments of the engineers working on the certificate rotation process were using freshly generated certificate authorities and certificates.
We didn’t catch the bug when the staging and production pipelines generated new certificates for the rotation process. We hadn’t set up a test in either pipeline to check that CredHub had generated certificates with valid trust chains. This is because we assumed CredHub would do this correctly. Because of this, the pipelines did not prevent the rotation.
We didn’t discover the bug in CredHub prior to the incident because we didn’t hear about it on either the CloudFoundry blog, or the community channels.
We now know that a Cloud Foundry Foundation contributor raised it as a user story on CredHub’s public roadmap in August 2019. The Cloud Foundry Foundation released a fix in a beta release on 17 April 2020, without mentioning it in the release notes. We upgrade CredHub to final releases - rather than the beta releases - for reliability reasons. The foundation made a final release available on 22 April, the day after our incident.
The issue affected services in both regions because we have typically deployed all changes to both regions simultaneously. As a result, both regions’ certificates reached expiry on the same day, having been originally generated on the same day.
Now that we have updated the platform’s security certificates, our next rotation is due to happen in five months. In the meantime, we are committing to these actions:
Next 3 weeks
Upgrade CredHub to a more recent version that fixes the bug that caused the incident.
Include a test of the certificate chain of trust in our deployment pipeline and prevent the pipeline from continuing if this test fails. This test will validate certificates using OpenSSL.
Test our automatic certificate rotation job thoroughly, and run it in staging and development environments more frequently.
Next 3 months
Look into improving error messages to end users if the platform is unavailable.
Find easier, quicker ways to practice incident response and recovery practices. Practice more types of incidents and fixes more often.
Consider even safer, more progressive ways to deploy wide-reaching platform changes.
During the incident, most end users would have seen a ‘404 - not found’ error message. This is the default behaviour of Cloud Foundry, but we are aware it would have been unhelpful for end users. We will be looking into whether it is possible to improve these error messages.
We classified the incident as “Major” when it was in fact the highest severity, “Critical”. We’ll rehearse our communication process in future to improve the accuracy.
It took us over two hours to realise that a platform-wide fix would be necessary. If we had monitored not just the date validity of certificates but also the certificate chain of trust, we would have realised the scale of the impact much faster. This would have prompted us to choose a platform fix more quickly. Better testing will help to prevent this in future.
We were also slow to consider a fix that would, in normal circumstances, cause downtime for users. In hindsight, we should have drilled these aggressive fixes during our incident rehearsals. From now on, we will ensure we practice more types of incidents and fixes, more often.
We deployed the fix in our usual way to our Ireland region, and then realised we could do it faster in our London region. Incident rehearsal will also help us remember these faster deployment methods, so we can restore service to both regions more quickly.
Tuesday 21 April 2020
Following a routine rotation of security certificates, we detected errors on the GOV.UK PaaS platform.
17:15 - 19:55
From 17:18 services hosted on the PaaS started reporting outages. Engineers started by investigating the trust between the different certificates used on the platform. Due to lack of recent rehearsal and visibility of the certificates affected, we initially focused on determining which certificate was being rejected, but then realised that multiple certificates were affected, and a platform-wide fix would be needed.
We took the decision to force a rotation of all certificates across the whole platform. This replaced the faulty certificates, and re-established trust relationships between the platform components.
Over the next few hours we monitored progress as the platform and the services it hosts started to come back online.
Wednesday 22 April 2020
We restored full service.