Share

Updated: June 1st, 2008

Recently I ran into major problems using GNU diff. It would crash with "diff: memory exhausted" after only a few minutes trying to process the differences between a couple 4.5GB files. Even a beefy box with 9GB of RAM would run out of it in minutes.

There is a different solution, however, that is not dependent on file sizes. Enter rdiff – rsync's backbone. You can read about it here: http://en.wikipedia.org/wiki/Rsync (search for rdiff).

The upsides of rdiff are:

  • with the same 4.5GB files, rdiff only ate about 66MB of RAM and scaled very well. It never crashed to date.
  • it is also MUCH faster than diff.
  • rdiff itself combines both diff and patch capabilities, so you can create deltas and apply them using the same program

The downsides of rdiff are:

  • it's not part of standard Linux/UNIX distribution – you have to install the librsync package.
  • delta files rdiff produces have a slightly different format than diff's.
  • delta files are slightly larger (but not significantly enough to care).
  • a slightly different approach is used when generating a delta with rdiff, which is both good and bad – 2 steps are required. The first one produces a special signature file. In the second step, a delta is created using another rdiff call (all shown below). While the 2-step process may seem annoying, it has the benefits of providing faster deltas than when using diff. In fact, you can pipe the first step into the second one without any trouble if you want, which is what I ended up doing).

Usage:

1
2
3
4
5
6
7
8
9
$ rdiff signature ORIGINAL.txt SIGNATURE.sig
 
$ l -h SIGNATURE.sig
-rw-r--r-- 1 user users 25M 2008-04-23 22:32 SIGNATURE.sig
 
$ rdiff delta SIGNATURE.sig MODIFIED.txt DELTA.rdiff
 
$ l -h DELTA.rdiff
-rw-r--r-- 1 user users 82M 2008-04-23 22:36 DELTA.rdiff

And here's what you would do to reassemble MODIFIED.txt:

1
2
3
4
5
6
$ rdiff patch ORIGINAL.txt DELTA.rdiff MODIFIED_REASSEMBLED.txt
 
$ l *.txt
-rw-r--r-- 1 user users 4,471,493,588 2008-04-23 20:24 MODIFIED.txt
-rw-r--r-- 1 user users 4,471,493,588 2008-04-23 22:44 MODIFIED_REASSEMBLED.txt
-rw-r--r-- 1 user users 4,403,302,981 2008-04-23 20:20 ORIGINAL.txt

Just as expected – everything matches.

Now, all of this could have been done in one go like this:

1
rdiff signature ORIGINAL.txt | rdiff delta -- - MODIFIED.txt DELTA.rdiff

As far as my usage of such a useful diff program, I was doing CSV dumps of certain fields from a MySQL database, like so:

1
SELECT * FROM table WHERE some_condition='1' ORDER BY id DESC INTO OUTFILE '/home/dump/dump.csv' FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"';

and then applying rdiff to get the [quite small] daily deltas.

That's all folks!

● ● ●

Artem Russakovskii is a San Francisco programmer, blogger, and future millionaire (that last part is in the works). Follow Artem on Twitter (@ArtemR) or subscribe to the RSS feed.

In the meantime, if you found this article useful, feel free to buy me a cup of coffee below.



Share

17 Responses to “A Better diff Or What To Do When GNU diff Runs Out Of Memory ("diff: memory exhausted")”

    17 Comments:
  1. Thank you for commenting on the continuation of the Lazyest Gallery project.

  2. nailbiter says:

    Just a nitpick: rdiff and rsync don't share any code. rdiff's backend, librsync, just happens to be another implementation of the rsync running-checksum algorithm.

    The delta files generated by "rsync –write-batch" are also very different to the ones rdiff produces.

  3. @nailbiter
    Thanks for the correction – I just assumed that rsync and librsync are related for obvious reasons.

  4. Ofer says:

    Does anyone know how to view the Delta output "diff"-style? or any other style?

  5. Fred says:

    I also noticed that this problem is specific to GNU diff. For instance on AIX 5.3 the System V version of diff doesn't have the same issue on the same pair of files.

  6. Harry says:

    Your post was very informative, thanks.

  7. Michael Anderson says:

    I get lots of crap when I'm comparing two large database dumps that are 30 seconds apart:

    – Table structure for tO6`^@^@'¨^@N6½^@^@^H^@O6½^P^@^@^A^@N6¾8^@h^@N6¾8^@X^@O6¿`^@^@^G^H^@N6ÆP^@^H^@O6Æp^@^@^_è^@N6æ0^@(^@N6æ0^@(^@N6æ0^@(^@N6æ0^@(^@N6æ0^@(^@N6æ0^@(^@N6æ0^@(^@N6æ0^@(^@N6æ0^@(^@N6æ0^@(^@N6çè^@È^@N6è¨^@^H^@N6è¨^@^H^@O6èÀ^@^@^BP^@N6ë^H^@^H^@N6ë^H^@^H^@N6ë^H^@^H^@N6ë^H^@^H^@N6ë^H^@^H^@N6ë^H^@^H^@N6ë^H^@^H^@N6ë^H^@^H^@N6ë^H^@^H^@N6ë^H^@^H^@O6ë`^@^@^F8^@N6ñP^@^P^@O6ñ¨^@^@^K^@N6ý^@^@ ^@O6ýX^@^@^Bp^@N6þ^@^@^P^@N6ÿØ^@^P^@N6ÿÀ^@^H^@N6þ^@^@^H^@O6ÿø^@^@^C¸^@N7^C^H^@¨^@N7^C^H^@¨^@N7^C^H^@^P^@O7^E^P^@^Aÿà^@N.@^@^H^@O9^Dø^@^@^T`^@N9^Q^@^P^@N9^Yh^@^@N9^R0^@^P^@N9^Z^P^@^H^@N9^RH^@^P^@O9^Z(^@^@*À^@N^AJp^@^H^@O9Dð^@^@^L^X^@N9'^P^@^H^@O9Q^P^@^@Ah^@N9M(^@^H^@O9^@^@bP^@N-ÑØ^@^H^@O9ôØ^@^@^V^@N.U^@^H^@O:^K`^@^@^M^@N:^Wè^@^H^@O:^Xð^@^@ ¸^@N9Ap^@^H^@O:"°^@^@^K^@N9Hø^@^H^@O:.8^@^@^Q¨^@N:>È^@^H^@O:?è^@^@^Nh^@N9_È^@^H^@O:NX^@^@^B^P^@N:-0^@^H^@N:Pp^@h^@N9D^@^H^@O:Pà^@^@^CÐ^@N:)^@^H^@O:T¸^@^@ ^H^@N:r^H^@^H^@N:tÈ^@À^@N:50^@^H^@O:u^@^@^G^P^@N9Ø^@^H^@N:|¨^@`^@N:^Xp^@^H^@O:}^P^@^@^A8^@N9LÐ^@^H^@N:~P^@^X^@N:^UØ^@^H^@O:~p^@^@^A^@^@N.À^@^H^@N:^?x^@P^@N9ø^@^H^@O:^?Ð^@^@^A`^@N9^P^@^H^@N:8^@¨^@N:~x^@^H^@O:è^@^@^Q^P^@N:?0^@^H^@N:^@^@^H^@N^Aê ^@^H^@O:^P^@^@^MP^@N-Ð^@^H^@O: h^@^@^SÐ^@N9¾(^@^H^@O:´@^@^@^F0^@N:h^@^H^@O:ºx^@^@^Z ^@N:p^@^P^@O:Ô¨^@^@^Gè^@N9´Ø^@^P^@O:Ü ^@^@^C^@N:Ð^@^H^@O:à(^@^@^Cp^@N9^¨^@^H^@O:ã ^@^@|`^@N:ø8^@^P^@O;`^P^@^@R ^@N:b ^@^H^@O;²¸^@^\ô^@B^GRT INTO `url_alias` VALUES (221693,'node/207969','connect/companies/lee-associates','und');

    • Michael Anderson says:

      Just to make it clear:

      # head /tmp/db1.sql
      – MySQL dump 10.13 Distrib 5.5.22, for Linux (x86_64)

      # head /tmp/db2.sql
      – MySQL dump 10.13 Distrib 5.5.22, for Linux (x86_64)

      ..

  8. Lisa says:

    Neat blog! Is your theme custom made or did you download it
    from somewhere? A design like yours with a few simple tweeks would really make my blog jump out.
    Please let me know where you got your design. Bless you

    Here is my site; garcinia cambogia

  9. Winnie says:

    Ӏ ωas able tо finԁ good information from your articleѕ.

    Here iѕ my homepage: composting toilet plans

  10. Ira says:

    Hurrah, that's what I was exploring for, what a information! present here at this webpage, thanks admin of this site.

    Here is my webpage; order dumpster

  11. Jessika says:

    Hi to all, it's truly a pleasant for me to pay a visit this site, it includes important Information.

    Also visit my blog post: toto drake ii toilet review

  12. Kurtis says:

    It's remarkable to pay a quick visit this website and reading the views of all friends concerning this post, while I am also keen of getting know-how.

    Look at my page rental trash dumpsters

  13. Gerardo says:

    My spouse and I absolutely love your blog and find the majority of your post's to be just what I'm looking
    for. Does one offer guest writers to write content to suit your needs?

    I wouldn't mind producing a post or elaborating on a lot of the subjects you write concerning here. Again, awesome weblog!

    my web-site; click through the next internet site

  14. Lenora says:

    Keep this going please, great job!

    Feel free to visit my website Garcinia Cambogia

  15. Heidi says:

    I'm truly enjoying the design and layout of your site. It's
    a very easy on the eyes which makes it much more enjoyable for me to come here and visit more often.
    Did you hire out a developer to create your theme?
    Fantastic work!

    Feel free to surf to my blog http://www.uctronics.com/

  16. Marylyn says:

    Awesome article.

    my blog; Garcinia Cambogia

Leave a Reply