MySQL Slave Lag (Delay) Explained And 7 Ways To Battle It
Updated: September 16th, 2012
Slave delay can be a nightmare. I battle it every day and know plenty of people who curse the serialization problem of replication. For those who are not familiar with it, replication on MySQL slaves runs commands in series – one by one, while the master may run them in parallel. This fact usually causes bottlenecks. Consider these 2 examples:
- Between 1 and 100 UPDATE queries are constantly running on the master in parallel. If the slave IO is only fast enough to handle 50 of them without lagging, as soon as 51 start running, the slaves starts to lag.
- A more common problem is when one query takes an hour to run (let's say, it's an UPDATE with a big WHERE clause that doesn't use an index). In this case, the query runs on the master for an hour, which isn't a big problem because it doesn't block other queries. However, when the query moves over to the slaves, all of them start to lag because it plugs up the single replication thread.
If the master can crunch more INSERTs/UPDATEs after an upgrade to a better machine, then replication will fall behind even faster.
There is nothing you can do right now to fix the way MySQL handles replication. If the replication threads could run in parallel, I'm guessing horrible things would happen to the data integrity due to race conditions, canceled queries, slave restarts, differences in query execution times due to server load and configuration, etc. Replication is already an asynchronous, prone to getting out of sync business (hint: use maatkit tools by Baron Schwartz and specifically mk-table-checksum and mk-table-sync to sync up your slaves).
In order to see if a slave is lagging, execute the 'show slave status' command and look for the Seconds_Behind_Master value. The way this value is calculated can be slightly ambiguous and unclear, so I'll explain. It is simply a difference between the 2 timestamps – the time of the last received (and queued up in the relay log) query that already executed on the master and the time of the currently executing query on the slave. Thus this value is not real time (it is possible to catch up to the master much faster); it's an approximation, or special metrics if you will, that helps point out problems.
So what can you do if you start hitting replication lag? This is the ultimate question, and the answer depends on your application. Here are the things I came up with after dealing with MySQL for a few years (there are undoubtedly other techniques, but these all come from my own experience):
- Normalize your data, if it is not already. Non-normalized tables lead to repetition and is generally considered bad practice. More data – more IO in most cases. There can be cases, however, where you can normalize too much. Having JOINs is much slower than not having them, and it can hurt your queries if you JOIN a lot. Finally, the extreme case is mentioned at highscalability.com: How I Learned to Stop Worrying and Love Using a Lot of Disk Space to Scale. "You–pause for dramatic effect–duplicate data instead of normalize it. *shudder*". Flickr is provided as an example.
- Shard (meaning, slice) your data, horizontally and vertically. For example, you can horizontally partition by some sort of key, hash, username, or other properties. You can also partition vertically by moving out some table columns into other databases. As an example, if you had a database of videos, storing view counts, number of favorites, etc. is OK but if these fields receive a lot of frequent updates, you are bound to have slave lag. Instead, separate these into a dedicated stats table(s). You don't have to shard all of your data – even sharding the most active bits helps immensely (for example, you can choose to shard your stats tables and leave the main one alone).
- Upgrade machines running MySQL (first slaves, then master, for the reasons given above). 99% of the time, disk IO is the bottleneck, CPU being the other 1%. Move to RAIDed setups (RAID10 or RAID0) with 6-10 15K RPM SCSI or SSD drives. Add a lot of RAM. Make sure you're running a 64 bit OS if you have more than 3GB of RAM, so that the mysql process may utilize more of it. My search for the best MySQL server under $10K may be of some help here.
- Separate your applications onto different MySQL instances. If you are running separate applications A, B, and C that don't depend on each other, consider giving them their own machines, otherwise a single long-running UPDATE or INSERT query in application A will delay all writes by application B and C. This is actually quite common – even though the server may not appear to be loaded, the annoying slave delay will still show its cowardly tail. I want to highlight this again: the replication thread is shared between all databases on the server.
- Another solution to (4) is multiple MySQL instances, granted that the MySQL machine isn't generally overloaded already. In that case, installing more than 1 mysql daemons would separate replication threads and allow running multiple applications, like A, B, and C on one machine, without affecting each other. MySQL sandbox achieves just that – it is my preferred solution.
- Split up longer running queries into shorter ones. This should be pretty straightforward – a single query on 10 million rows may run a few hours. Splitting it into batches of 50,000, for example, will give other queries a chance to run in between. Of course, you should take care of data integrity and generally double check what you are doing.
- Don't overload the same slave by sending all queries to it, as it will just make the matter worse. You can round-robin the queries using either round-robin DNS (eww), round-robin within the application logic (better), smarter application logic, like checking slave load and status from time to time, or my personal favorite – using MySQL proxy and having it pick the least lagging slave for you. An official solution utilizing mysql proxy, called MySQL load balancer, is apparently in the works (I was promised beta access but haven't got it so far).
As a bonus, I wanted to throw in this idea of helping minimize a certain corner case cause of slave delay and feed it to the hungry MySQL minds. I'm not sure if it is mentioned anywhere else, as I have not Googled it. If it's a widely known fact, then I will consider this post as just adding my vote to the usefulness of the technique.
The reasoning is simple: INSERTs propagate to all the slaves and have to run on a master. SELECTs can run on any slave and don't propagate anywhere, so if only 0.01% of the queries result in new rows, this technique will get rid of a lot of unnecessary slave query traffic.
Well, there you have it. Comments are open, so feel free to share your own replication strategies and thoughts about mine.
In the meantime, if you found this article useful, feel free to buy me a cup of coffee below.