We’ve resolved the issue that affected GOV.UK PaaS today, and full service has been restored to development and production applications running on the platform.
The routers in the production environment exhausted their available memory, causing them to provide a degraded service. Eventually the processes were stopped by the out-of-memory killer. Once all of the routers had restarted normal service was restored.
Tenant applications experienced slow responses and occasional 502 (bad gateway) http errors. These will have been visible to end users. The incident lasted ~5.5 hours (from 04:45 to 10:22).
We’ll now start looking into why and how this happened. In the coming days, we’ll publish an incident report describing the timelines of the event, root cause of the problem, lessons we’ve learned and actions we’ll take to ensure it doesn’t happen again.
I’m sorry for the impact that this has had on your users and the service you provide, and the problems this has caused for your team.
We identified that some of our routers had exhausted their allocated memory. The operating systems out-of-memory killer eventually killed the routers, causing them to restart. During this time users will have experienced slow requests.
We expect apps and API to be performing normally again now. We will continue to monitor the situation.
We’re investigating a possible problem with GOV.UK PaaS.
A problem has been reported and we’re investigating whether it’s an issue with the platform or with certain services/apps running on it, and whether or not end users are affected.