Problems with MySQL Disaster Recovery

I recently had a call from a company that had suffered several days downtime due to a MySQL database failure. The main server had had a hardware fault which prevented it from booting. Their infrastructure consisted of a MySQL master/slave replication set up which they considered adequate for disaster and recovery (D/R) . However, as was plain for them to see it was not. This got me thinking about the complexities of recovering from a MySQL Slave and how it’s not a D/R solution at all.

MySQL Master/Slave is only a data replication solution, not D/R.

MySQL Master/Slave is relatively easy to set up in a few small steps and hey presto you have a second database server which is a copy of your live from which you can read data. However, one common mistake is to think all transactions are copied from the Master and that the replication is robust. This is not the case. The Slave can quickly lose replication and does not generate any kind of warning. Even if replication is working fine let’s walk through a scenario of a Master failure.

Without warning the Master server fails, either hardware or software failure, and you have some type of home brew script that alerts you or even switches the Slave to be Master. The first thing to mention is how your application handles the change. If you’re using some kind of virtual IP you may have scripted the move to the Slave (and remembered to flush the ARP Cache on the switch?). Or you’ve changed the MySQL Master IP on each of the application servers.

The worst case is that replication broke sometime ago and you’ve only just notice. So now you have to recover from the last dump which may or may not be uptodate. In this case you’ve got to get the old Master back up ASAP and run the application from an old data set then try to merge the data back to the old Master once it comes online. Nightmare situation.

So your old Slave is now the new Master but your site is running slow because the Slave was acting as a read only server and you can’t take any dumps because that will lock the tables and the application will grind to a halt. This is the best case scenario.

The question is now what to do with the old Master. If it’s the same specification as the Slave then it makes sense to use the old Master as the new Slave. If you do this, you have to remember to check that your home brew failover scripts may need amending and application connections changing. However, in order to make the old Master the new Slave, you need to set up replication from scratch again. It get even more complicated if you have more than 1 Slave as there’s no way of making an existing Slave a Slave of the new Master without re dumping the database and setting it up again.

With Continent Tungsten, all this is elevated as failover is completely automatic, introducing new Slaves is automatic, promoting Slaves to Master is automatic and there are no changes to your application code.

This entry was posted in Managed Hosting. Bookmark the permalink.

Comments are closed.