Issues with MySQL and Postgres unbinds (P2)
Postmortem

What happened and why

An change to the rds-broker introduced a problem with removing existing bindings between an application and an RDS (MySQL or Postgres) service.

What caused the original problem?
The rds-broker changes introduced changes to the way that usernames and passwords are generated when an application is bound to an RDS service (using SHA256 instead of MD5).

For RDS services which were bound prior to the change, the unbind failed as the username used for the binding was created in the old way, but the rds-broker tried to delete a (nonexistent) username generated in the new way.

Why wasn’t this caught before it reached prod?
Checking for this was not a part of our acceptance tests, there are no straightforward automated tests for errors of this kind. The error was reported by a user.

How users were affected

We were alerted by a ticket from a user that there was a problem with deleting an app.

This did not affect the availability of applications hosted on GOV.UK PaaS.
This would have caused errors for tenants who tried to unbind most existing Postgres or MySQL databases, but did not affect new bindings (and unbindings which follow this).

We put updates about this issue on Statuspage, but realised that users weren’t receiving notifications.

How we’ll prevent this from happening again and improve our response

  • We’ll improve our processes for accepting small changes into the platform, to reduce the risk of errors
  • We’ll document the statefulness of the rds-broker to ensure that any changes made in this area, take this into account.
  • We’ll add in metrics for any rds-broker managed database instances that cannot be connected to with the master password.
  • We’ll fix our statuspage notifications so that anyone signed up to receive notifications of issues, actually receives them
Posted May 31, 2018 - 17:25 BST

Resolved
Hello,

We’ve resolved the issue that affected GOV.UK PaaS today, and full service has been restored to development and production applications running on the platform.

Some users were unable to unbind Postgres or MySQL databases from apps.
We have applied a patch to our rds-broker which means that users are no longer affected by this issue.

We’ll now start looking into why and how this happened. In the coming days, we’ll publish an incident report describing the timelines of the event, root cause of the problem, lessons we’ve learned and actions we’ll take to ensure it doesn’t happen again.

I’m sorry for the impact that this has had on your users and the service you provide, and the problems this has caused for your team.

The quickest way to get help using the platform is to email us via gov-uk-paas-support@digital.cabinet-office.gov.uk. To let us know how this incident has affected you, please contact the GOV.UK PaaS product managers: ben.andrews@digital.cabinet-office.gov.uk and jessica.o’leary@digital.cabinet-office.gov.uk.

Regards,

Urmi Shah
GOV.UK PaaS team
Posted May 17, 2018 - 12:59 BST
Identified
Hello,

We are actively working on the issue with unbinding existing Postgres or MySQL databases.

We have established that this will affect the unbinding of most existing Postgres or MySQL databases, but will not affect new bindings (and unbindings which follow this). We are working on a fix and will let you know as soon as we have applied it

- this should not affect the availability of your service
- this means that until the fix is applied, you may not be able to unbind Postgres or MySQL databases from apps

Fixing this issue is our priority - we know that this may have impacted on the service you provide and we’re doing everything we can to resolve it as quickly as possible.

We’ll continue to update you as we know more, and we’ll let you know as soon as the problem has been resolved.

We’re sorry for the inconvenience that this has caused to your users and your team.

If you need to contact us for help or anything else, please email us via gov-uk-paas-support@digital.cabinet-office.gov.uk

Regards,

Urmi Shah
GOV.UK PaaS team
Posted May 17, 2018 - 10:59 BST
Update
Hello,

We are aware of and investigating a problem with GOV.UK PaaS.

Some users are reporting issues with unbinding MySQL and Postgres databases. We think this might also affect binding databases to apps.

We’re looking into this as a matter of urgency and will update you as soon as we know more.

Regards,

Urmi Shah
GOV.UK PaaS team
Posted May 17, 2018 - 10:41 BST
Update
We are continuing to investigate this issue.
Posted May 17, 2018 - 10:25 BST
Investigating
Hello,

We’re investigating a possible problem with GOV.UK PaaS.

A problem has been reported and we’re investigating whether it’s an issue with the platform or with certain services/apps running on it, and whether or not end users are affected.

If your service is experiencing problems and you think it might be related to this issue, please email us via gov-uk-paas-support@digital.cabinet-office.gov.uk.

We will update you as soon as we know more.

Regards,

Urmi Shah
GOV.UK PaaS team
Posted May 17, 2018 - 10:21 BST