Introduction

A question I recently saw on Stack Overflow titled Faster way to delete matching [database] rows? prompted me to organize my thoughts and observations on the subject and quickly jot them down here.

Here is the brief description of the task: say, you have 2 MySQL tables a and b. The tables contain the same type of data, for example log entries. Now you want to delete all or a subset of the entries in table a that exist in table b.

Solutions Suggested By Others

DELETE FROM a WHERE EXISTS (SELECT b.id FROM b WHERE b.id = a.id);
DELETE a FROM a INNER JOIN b on a.id=b.id;
DELETE FROM a WHERE id IN (SELECT id FROM b)

The Problem With Suggested Solutions

Solutions above are all fine if the tables are quite small and the SELECT/JOIN is fast. However, in large scale situations with replication, these queries could potentially lock up the tables from writes and severely aggravate slave lag because, as I mentioned in the MySQL Slave Lag (Delay) Explained And 7 Ways To Battle It post, replication is single-threaded.

Thus, if a single UPDATE/DELETE query takes a considerable amount of time, when it propagates to the slaves, they will be stuck executing it and doing nothing else, lagging behind more and more.

My Thoughts And Solution

I have personally dealt with having to delete many rows from one table that exist in another and in my experience it's best to do the following, especially if you expect lots of rows to be deleted. This technique most importantly will improve replication slave lag.

So, here it is: do a SELECT first, as a separate query, remembering the IDs returned in your script/application, then continue on deleting in batches (say, 50,000 rows at a time). This will achieve the following:

  • each one of the delete statements will not lock the table for too long, thus not letting replication lag get out of control. It is especially important if you rely on your replication to provide you relatively up-to-date data. The benefit of using batches is that if you find that each DELETE query still takes too long, you can adjust it to be smaller without touching any DB structures.
  • another benefit of using a separate SELECT is that the SELECT itself might take a long time to run, especially if it can't for whatever reason use the best DB indexes. If the SELECT is inner to a DELETE, when the whole statement migrates to the slaves, it will have to do the SELECT all over again, potentially lagging the slaves because of how long that SELECT will take. If you use a separate SELECT query, this problem goes away, as all you're passing to the DELETE query is a list of IDs.

Do you have another opinion or see a fault with my logic? Feel free to share in the comments.

P.S. One thing to be careful about is, of course, potential edits to the table between the times the SELECT finishes and DELETEs start. I will let you handle such details by using transactions and/or logic pertinent to your application.

● ● ●
Artem Russakovskii is a San Francisco programmer and blogger. Follow Artem on Twitter (@ArtemR) or subscribe to the RSS feed.

In the meantime, if you found this article useful, feel free to buy me a cup of coffee below.