Issues with MySQL and Postgres binding existing service to new apps (P3)
Postmortem

What happened and why

We released a new version of the rds-broker, which surfaced an error in creating a binding from an RDS service (MySQL or Postgres) which already existed, to an app that it was not previously bound to.

What caused the original problem?
There was an existing bug in the rds-broker which meant that when we changed how we generated root passwords, the broker did not upgrade the root password on all the services.
The bug surfaced during this release as we hadn’t used the root password rotation recently.

Why wasn’t this caught before it reached prod?
Checking for this was not a part of our acceptance tests, there are no straightforward automated tests for errors of this kind. The error was found by chance as a team member checked whether binding/unbinding worked for existing RDS services on production.

How users were affected

This did not affect the availability of applications hosted on GOV.UK PaaS. Users (tenants) of the platform would have seen errors if they had tried to bind RDS databases which already existed, to an app that it was not previously bound to. We don’t believe any tenants were actually affected.

How we’ll prevent this from happening again and improve our response

  • We’ll document the statefulness of the rds-broker to ensure that any changes made in this component, take this into account.
  • We’ll add in metrics for any rds-broker managed database instances that cannot be connected to with the master password.
Posted May 31, 2018 - 17:15 BST

Resolved
Hello,

We’ve resolved the issue that affected GOV.UK PaaS today, and full service has been restored to development and production applications running on the platform.

The issue was that some services could have experienced failures if they tried to bind an RDS database which already existed (MySQL or Postgres), to an app that it was not previously bound to.

We have tested and applied a fix to the platform.

In the coming days, we’ll publish an incident report describing the timelines of the event, root cause of the problem, lessons we’ve learned and actions we’ll take to ensure it doesn’t happen again.

We are not aware of any services which have been impacted, I'm sorry if there has been an impact on your users and the service you provide, and the problems this has caused for your team.

The quickest way to get help using the platform is to email us via gov-uk-paas-support@digital.cabinet-office.gov.uk. To let us know how this incident has affected you, please contact the GOV.UK PaaS product managers: ben.andrews@digital.cabinet-office.gov.uk and jessica.o’leary@digital.cabinet-office.gov.uk.

Regards,

Urmi Shah, Delivery Manager
GOV.UK PaaS team
Posted May 16, 2018 - 15:45 BST
Update
Hello,

Hello,

We are aware of and investigating a problem with GOV.UK PaaS.

There is an issue which could affect users of your service. Some services will experience failures if you try to bind an RDS database which already exists (MySQL or postgres), to an app that it was not previously bound to.

We have identified a fix, and will update you as soon as we know more.

Regards,

Urmi Shah, Delivery Manager
GOV.UK PaaS team
Posted May 16, 2018 - 14:51 BST
Investigating
Hello,

We’re investigating a possible problem with GOV.UK PaaS.

A problem has been reported and we’re investigating whether it’s an issue with the platform or with certain services/apps running on it, and whether or not end users are affected.

If your service is experiencing problems and you think it might be related to this issue, please email us via gov-uk-paas-support@digital.cabinet-office.gov.uk.

We will update you as soon as we know more.

Regards,

Urmi Shah
GOV.UK PaaS team
Posted May 16, 2018 - 14:37 BST