13

MySQL Slave Lag (Delay) Explained And 7 Ways To Battle It


Posted by Artem Russakovskii on September 5th, 2008 in Databases, Programming

Updated: September 16th, 2012

http://forge.mysql.com/w/images/1/1e/Dolphin_Laptop_cropped-386x222.jpgSlave delay can be a nightmare. I battle it every day and know plenty of people who curse the serialization problem of replication. For those who are not familiar with it, replication on MySQL slaves runs commands in series – one by one, while the master may run them in parallel. This fact usually causes bottlenecks. Consider these 2 examples:

  • Between 1 and 100 UPDATE queries are constantly running on the master in parallel. If the slave IO is only fast enough to handle 50 of them without lagging, as soon as 51 start running, the slaves starts to lag.
  • A more common problem is when one query takes an hour to run (let's say, it's an UPDATE with a big WHERE clause that doesn't use an index). In this case, the query runs on the master for an hour, which isn't a big problem because it doesn't block other queries. However, when the query moves over to the slaves, all of them start to lag because it plugs up the single replication thread.

Sidenote: when I hear an argument that a master has to be the most powerful machine in the group, I cringe at the logic.

If the master can crunch more INSERTs/UPDATEs after an upgrade to a better machine, then replication will fall behind even faster.

There is nothing you can do right now to fix the way MySQL handles replication. If the replication threads could run in parallel, I'm guessing horrible things would happen to the data integrity due to race conditions, canceled queries, slave restarts, differences in query execution times due to server load and configuration, etc. Replication is already an asynchronous, prone to getting out of sync business (hint: use maatkit tools by Baron Schwartz and specifically mk-table-checksum and mk-table-sync to sync up your slaves).

In order to see if a slave is lagging, execute the 'show slave status' command and look for the Seconds_Behind_Master value. The way this value is calculated can be slightly ambiguous and unclear, so I'll explain. It is simply a difference between the 2 timestamps – the time of the last received (and queued up in the relay log) query that already executed on the master and the time of the currently executing query on the slave. Thus this value is not real time (it is possible to catch up to the master much faster); it's an approximation, or special metrics if you will, that helps point out problems.

So what can you do if you start hitting replication lag? This is the ultimate question, and the answer depends on your application. Here are the things I came up with after dealing with MySQL for a few years (there are undoubtedly other techniques, but these all come from my own experience):

  1. Normalize your data, if it is not already. Non-normalized tables lead to repetition and is generally considered bad practice. More data – more IO in most cases. There can be cases, however, where you can normalize too much. Having JOINs is much slower than not having them, and it can hurt your queries if you JOIN a lot. Finally, the extreme case is mentioned at highscalability.com: How I Learned to Stop Worrying and Love Using a Lot of Disk Space to Scale. "You–pause for dramatic effect–duplicate data instead of normalize it. *shudder*". Flickr is provided as an example.
  2. Shard (meaning, slice) your data, horizontally and vertically. For example, you can horizontally partition by some sort of key, hash, username, or other properties. You can also partition vertically by moving out some table columns into other databases. As an example, if you had a database of videos, storing view counts, number of favorites, etc. is OK but if these fields receive a lot of frequent updates, you are bound to have slave lag. Instead, separate these into a dedicated stats table(s). You don't have to shard all of your data – even sharding the most active bits helps immensely (for example, you can choose to shard your stats tables and leave the main one alone).
  3. Upgrade machines running MySQL (first slaves, then master, for the reasons given above). 99% of the time, disk IO is the bottleneck, CPU being the other 1%. Move to RAIDed setups (RAID10 or RAID0) with 6-10 15K RPM SCSI or SSD drives. Add a lot of RAM. Make sure you're running a 64 bit OS if you have more than 3GB of RAM, so that the mysql process may utilize more of it. My search for the best MySQL server under $10K may be of some help here.
  4. Separate your applications onto different MySQL instances. If you are running separate applications A, B, and C that don't depend on each other, consider giving them their own machines, otherwise a single long-running UPDATE or INSERT query in application A will delay all writes by application B and C. This is actually quite common – even though the server may not appear to be loaded, the annoying slave delay will still show its cowardly tail. I want to highlight this again: the replication thread is shared between all databases on the server.
  5. Another solution to (4) is multiple MySQL instances, granted that the MySQL machine isn't generally overloaded already. In that case, installing more than 1 mysql daemons would separate replication threads and allow running multiple applications, like A, B, and C on one machine, without affecting each other. MySQL sandbox achieves just that – it is my preferred solution.
  6. Split up longer running queries into shorter ones. This should be pretty straightforward – a single query on 10 million rows may run a few hours. Splitting it into batches of 50,000, for example, will give other queries a chance to run in between. Of course, you should take care of data integrity and generally double check what you are doing.
  7. Don't overload the same slave by sending all queries to it, as it will just make the matter worse. You can round-robin the queries using either round-robin DNS (eww), round-robin within the application logic (better), smarter application logic, like checking slave load and status from time to time, or my personal favorite – using MySQL proxy and having it pick the least lagging slave for you. An official solution utilizing mysql proxy, called MySQL load balancer, is apparently in the works (I was promised beta access but haven't got it so far).

As a bonus, I wanted to throw in this idea of helping minimize a certain corner case cause of slave delay and feed it to the hungry MySQL minds. I'm not sure if it is mentioned anywhere else, as I have not Googled it. If it's a widely known fact, then I will consider this post as just adding my vote to the usefulness of the technique.

Tip: If you have replication setups that use a lot of INSERT commands and you expect that most of such INSERTs would dupe with existing data (and you are using INSERT IGNORE, not REPLACE), consider replacing such queries with SELECTs, followed by only necessary INSERTs.

The reasoning is simple: INSERTs propagate to all the slaves and have to run on a master. SELECTs can run on any slave and don't propagate anywhere, so if only 0.01% of the queries result in new rows, this technique will get rid of a lot of unnecessary slave query traffic.

Well, there you have it. Comments are open, so feel free to share your own replication strategies and thoughts about mine.

● ● ●

Artem Russakovskii is a San Francisco programmer, blogger, and future millionaire (that last part is in the works). Follow Artem on Twitter (@ArtemR) or subscribe to the RSS feed.

In the meantime, if you found this article useful, feel free to buy me a cup of coffee below.



Share
  • http://patg.net CaptTofu

    One more thing to get rid of slave lag:

    Get rid of FULLTEXT and use Sphinx.

  • http://blog.onefreevoice.com Gregory Haase

    This is assuming that your slave lag occurs in the slave's SQL thread. It's a good idea to check your IO thread and make sure you don't have any lag there. Otherwise you can be misled by the fact that Seconds_Behind_Master will sometimes show as 0 in some instances where the SQL thread has caught up to the IO thread and the IO thread is behind (SQL thread only knows about statements that are in the relay log, so if relay log is fully processed and waiting for I/O, as far as replication knows, it's all caught up).

    On the master issue: SHOW MASTER STATUS
    On the slave issue: SHOW SLAVE STATUS

    Compare "File" on Master Status to "Master_Log_File" on Slave Status; "Position" on Master Status to "Read_Master_Log_Pos" on Slave Status.

  • http://beerpla.net Artem Russakovskii

    @Gregory
    good point, though such cases are mostly rare unless the network is really, really congested.

    @CaptTofu
    FULLTEXT will slow down indexing, excellent point as well if you're using it.

  • http://dailyvim.blogspot.com Travis Whitton

    Be cautious using MySQL Proxy. I evaluated the current release as a load balancing solution here at work, and it would randomly lockup after a few days of running. I tried it on multiple machines and experienced the same issues. It's a neat piece of software, but be aware that it's not even beta quality yet. Nice article though.

  • Sean

    Gregory – that is a necessary check to have as I've experienced that problem when network related issues happen (iptables restart, bugs in load balancer failovers, etc). A Nagios script exists to handle what you're talking about:
    http://openquery.com.au/resources/tools/nagios_scripts

  • http://www.sheeri.com Sheeri Cabral

    These are all very good points….to make it simpler, there are only a few things you can do to write scale your system:

    1) beef up the system(s) so it can accept more writes — your #3.

    2) accept the limits of your systems, and don't try to overload them with more writes than they can handle — your #2, 4, 5 and 7

    3) make reads and writes faster, so your system has better throughput — your #1 and 6.

    Your "bonus" case is sadly an example of people not thinking, and being lazy in a bad way. How many times does an application do bulk data inserts with a large percentage of duplicate data? Those queries should be written smarter to begin with…

  • http://beerpla.net Artem Russakovskii

    @Sheeri
    I've seen a lot of cases where people preferred to have the database handle insert duplication (due to unique key or other constraints), rather than checking for existing data and then inserting. It is simpler (for them) to do 1 query rather than 2.

  • http://beerpla.net Artem Russakovskii

    @Travis, @Sean
    thanks, I just dug your comments out of Spam Karma spam harvest. I'm going to look into the settings because it's the 4th recent comment wrongly marked as spam.

  • http://www.nomadicnation.com/ nomadhacker

    I've been in the process of moving my site over to a vps and administering it myself. In so doing, I've been scouring the internet for info to help improve mysql performance and stumbled on your blog. I've still got to go through all your db posts but I've found another bookmark. Thanks.

  • http://beerpla.net Artem Russakovskii

    @nomadhacker
    you're welcome.

  • donny

    I'm new to PHP and recently setup my local machine with PHP and MySQL for doing development. I was sort of stuck when I needed to post my work for the user to test and review. After looking around a bit I found a site that hosts PHP and MySQL apps. I was surprised that it was free – it seems they're offering the service at no cost until 2012. At that point they'll change over to a fee-based service. However, in the meantime, it's a great place to do anything from demo and sandbox right up to posting sites for real.

    Their pitch is as follows:

    "This is absolutely free, there is no catch. You get 350 MB of disk space and 100 GB bandwidth. They also have cPanel control panel which is amazing and easy to use website builder. Moreover, there is no any kind of advertising on your pages."

    Check it out using this link:

    http://www.000webhost.com/83188.html

    Important: There's one catch in that you must make sure you visit the account every 14 days – otherwise the account is marked 'Inactive' and the files are deleted!!!

    Thanks and good luck!

  • Robert Cross

    Very informative, but keep in mind that you may also suffer from faulty replication lag, just due to clocks being off. It's described in greater detail here: http://www.memfill.com/tech/mysql-replication-lag/

  • james ransom

    You can always do a "show slave status" to see how far your machine is behind. You can also use php / perl to detect how far the machine is behind and send yourself a text message. You can expand on this and detect which queries caused the lag. Here is the code:

    http://www.jamesransom.net/?p=196