A Better diff Or What To Do When GNU diff Runs Out Of Memory ("diff: memory exhausted")
Monday, May 12th, 2008
Recently I ran into major problems using GNU diff. It would crash with "diff: memory exhausted" after only a few minutes trying to process the differences between a couple 4.5GB files. Even a beefy box with 9GB of RAM would run out of it in minutes.
There is a different solution, however, that is not dependent on file sizes. Enter rdiff - rsync's backbone. You can read about it here: http://en.wikipedia.org/wiki/Rsync (search for rdiff).
The upsides of rdiff are:
- with the same 4.5GB files, rdiff only ate about 66MB of RAM and scaled very well. It never crashed to date.
- it is also MUCH faster than diff.
- rdiff itself combines both diff and patch capabilities, so you can create deltas and apply them using the same program
The downsides of rdiff are:
- it's not part of standard Linux/UNIX distribution - you have to install the librsync package.
- delta files rdiff produces have a slightly different format than diff's.
- delta files are slightly larger (but not significantly enough to care).
- a slightly different approach is used when generating a delta with rdiff, which is both good and bad - 2 steps are required. The first one produces a special signature file. In the second step, a delta is created using another rdiff call (all shown below). While the 2-step process may seem annoying, it has the benefits of providing faster deltas than when using diff. In fact, you can pipe the first step into the second one without any trouble if you want, which is what I ended up doing).
Usage:
1 2 3 4 5 6 7 8 9 | $ rdiff signature ORIGINAL.txt SIGNATURE.sig $ l -h SIGNATURE.sig -rw-r--r-- 1 archon810 users 25M 2008-04-23 22:32 SIGNATURE.sig $ rdiff delta SIGNATURE.sig MODIFIED.txt DELTA.rdiff $ l -h DELTA.rdiff -rw-r--r-- 1 archon810 users 82M 2008-04-23 22:36 DELTA.rdiff |
And here's what you would do to reassemble MODIFIED.txt:
1 2 3 4 5 6 | $ rdiff patch ORIGINAL.txt DELTA.rdiff MODIFIED_REASSEMBLED.txt $ l *.txt -rw-r--r-- 1 archon810 users 4,471,493,588 2008-04-23 20:24 MODIFIED.txt -rw-r--r-- 1 archon810 users 4,471,493,588 2008-04-23 22:44 MODIFIED_REASSEMBLED.txt -rw-r--r-- 1 archon810 users 4,403,302,981 2008-04-23 20:20 ORIGINAL.txt |
Just as expected - everything matches.
Now, all of this could have been done in one go like this:
1 | rdiff signature ORIGINAL.txt | rdiff delta -- - MODIFIED.txt DELTA.rdiff |
As far as my usage of such a useful diff program, I was doing CSV dumps of certain fields from a MySQL database, like so:
1 | SELECT * FROM table WHERE some_condition='1' ORDER BY id DESC INTO OUTFILE '/home/dump/dump.csv' FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"'; |
and then applying rdiff to get the [quite small] daily deltas.
That's all folks!
Sun Definitely Developing A Phone This Year
Monday, April 21st, 2008
One thing that still springs to mind when I think of the MySQL User Conference last week is Sun's opening keynote. While talking about Sun's market penetration with open source software, Jonathan Schwartz, Sun's CEO, slipped in a short mention of the mobile market saying something along the lines of "Sun is going to be entering the mobile market later on this year". He didn't spend more than 5 seconds talking about it, moving on to the acquisition of MySQL.
Last year, Sun already made an announcement of JavaFX, a Java-based mobile platform but didn't provide any concrete timelines, so I was excited to hear the more on the subject. With Apple iPhone's advent last year and Google entering the same space later on this year with Android, Sun's addition to the game definitely won't hurt consumers. After all, competition usually leads to better products.
So when is the phone coming, Jonathan? My guess is you're going to try as hard as possible to compete with Google's second half of 2008 timeline but are you going to manage to beat it? And will it blow us away? I guess time will show.
MySQL Conference Liveblogging: Optimizing MySQL For High Volume Data Logging Applications (Thursday 2:50PM)
Thursday, April 17th, 2008
- http://en.oreilly.com/mysql2008/public/schedule/detail/874
- presented by Charles Lee of Hyperic
- Hyperic has the best performance with MySQL out of MySQL, Oracle, and Postgres in their application
- I suddenly remember hyperic was highly recommended above nagios in MySQL Conference Liveblogging: Monitoring Tools (Wednesday 5:15PM)
- performance bottleneck
- the database
- CPU
- memory
- IO
- disk latency
- network latency
- slow queries
- media size deployment example
- 300 platforms (300 remote agents collecting data)
- 2,100 servers
- 21,000 services (10 services per server), sounds feasible
- 468,000 metrics (20 metrics per service)
- 28,800,000 metric data rows per day
- larger deployments have a lot more of these (sounds crazy)
- data
- measurement_id
- timestamp
- value
- primary key (timestamp, measurement_id)
- data flow
- agent collects data and sends reports to server with multiple data points
- server batch inserts metric data points
- if network connection fails, agent continues to collect but server "backfills" unavailable
- when agent reconnects, spooled data overwrite backfilled data points (why not use REPLACE for all inserts?)
- things are very basic so far
- batch insert
- INSERT INTO TABLE (a,b,c) VALUES (0,0,0), (1,1,1),…
- using MySQL batch insert statements vs prepared statements with multiple queries in other databases seems to improve overall performance by 30%
- batch inserts are limited by 'max_allowed_packet'
- other options for increasing insert speed
- set unique_checks=0, insert, set unique_checks=1 (definitely need to make sure data is valid first)
- set foreign_key_checks=0, insert, set foreign_key_checks=1 (same concerns as above)
- Hyperic doesn't use the 2 above
- INSERT … ON DUPLICATE KEY UPDATE
- when regular INSERT fails, retry batch with INSERT ON DUPLICATE KEY syntax
- it's much slower but it allows
- this is all basic, where are the performance tweaks?!
- batch aggregate inserter
- queue metric data from separate agent reports
- minimize number of inserts, connections, CPU load
- maximize workload efficiency
- optimal configuration for 700 agents
- 3 workers
- 2000 batch size seems to work best
- queue size of 4,000,000
- this seems to peak at 2.2mil metric data inserts per minute
- data consolidation
- inspired by rrdtool
- lower resolution tables track min, avg, and max
- data compression runs hourly
- size limit 2 days
- every hour, data is rolled up into another table that holds hourly aggregated values with size limit 14 days, then that one gets rolled up into a monthly table, etc
- this is is a good approach if you don't care about each data point
- I'm overwhelmed by the amount of "you know"s from the speaker. Parasite words, ahh! Sorry Charles
- software partitioning
- measurement data split into 18 tables, representing 9 days (2 per day)
- they didn't want to do more than 2 SELECTs to get data per day, hence such sharding
- oddly, Charles didn't actually use the word 'shard' once
- tables truncated, rather than deleting rows => huge performance boost
- truncation vs deletion
- deletion causes contention on rows
- truncation doesn't produce fragmentation
- truncation just drops and recreates the table - single DDL operation
- indexes
- every InnoDB table has a special index called the clustered index (based on primary key) where the physical data for the rows is stored
- advantages
- selects faster - row data is on the same page where the index search leads
- inserts in (timestamp) order - avoid page splits and fragmentation
- shows comparison between non-clustered index and clustered index (see slides)
- still no mention of configuration tweaks
- UNION ALL works better than inner SELECTS because the optimizer didn't optimize them enough (at least in the version these guys are using, not sure which)
- recommended server options are on the very last slide, I was waiting for those the most! I guess I'll look up the slides after
MySQL Conference Liveblogging: MySQL Hidden Treasures (Thursday 11:55PM)
Thursday, April 17th, 2008
- Damien Seguy of Nexen Services presents
- easiest session of all (phew, that's a relief)
- clever SQL recipes
- tweaking SQL queries
- shows an example where SELECT is ORDERED by a column that is actually an enum.
- an enum is both a string and a number
- sorted by number
- displayed as string
- can be sorted by string if it's cast as string
- compact column
- compacts storage
- faster to search
- if (var)char is turned into enum, some space can be saved, shows example
- random order
- order by rand(1) - obviously
- the integer parameter is actually a seed
- really slow, also obviously, especially for larger tables because it has to order first, then apply rand() to the list
- another solution is to add an extra column, put random values into it, and add index, then
- UPDATE tbl SET chaos=RAND();
- random extraction
- SELECT id, col FROM tbl JOIN
(SELECT CEIL(RAND() *
(SELECT MAX(id) FROM tbl)) AS r)
AS r2 ON id=r; - for multiple rows, Damien offers a solution involving an integer table (see his own slides). This solution is supposed to be really fast.
- integer table
- generate a table with only 10 integers
- do cross joins to get 100, 10,000, etc integers
- 10mil rows takes only 13 seconds (on his laptop), then either store it in another table or just use the result set
- yet again, code is shows, so see slides
- generating alphabet
- shows a handy query
- generating text
- offers a long query that uses ELT(), CONCAT(), and RAND() (see slides)
- a SELECT that builds words uses GROUP_CONCAT()
- looks pretty useful
- ASCII art
- SELECT version, REPEAT('*', percentage * 2) AS bars FROM php_versions;
- shows a nifty result set with stars as fillers in percentage bars
- shows another result set, similar to the one above, with spaces as fillers and stars at the end
- shows another example that looks like some sort of a spaceship, from mysql forge
- GROUP_CONCAT (can I get a woot? I love this function)
- CONCAT() and CONCAT_WS() now for groups
- ORDER BY
- SEPARATOR
- limited to 1kb by default, change group_concat_max_len
- shows example (see slides)
- how to get a second to last value?
- uses GROUP_CONCAT, looks horribly long and inefficient, there has to be a better way
- somebody suggests using ORDER BY with LIMIT
- rankings
- uses MySQL vars
- not as many people as I thought seem to know about it, judging by the low amount of raised hands
- SET @var:=3
- SELECT @var;
- variables are not shared between connections, destroyed at the end of the connection
- uses LEAST()
- see slides
- agile loading
- SET @i:=0
- LOAD DATA INFILE '/tmp/stats.txt'
INTO TABLE statPHPload
(@date, @php, @country, @continent)
SET
id = 0,
period = date(STR_TO_DATE(@date, '%d-%b-%y')),
rank = (@i := @i + 1),
php = CAST(REPLACE(@php, ',', '.') AS DECIMAL),
country = @country;
MySQL Conference Liveblogging: Monitoring Tools (Wednesday 5:15PM)
Wednesday, April 16th, 2008
Updated: April 18th, 2008
- Tom Hanlon of MySQL presents
- monitoring tool basics
- SHOW FULL PROCESSLIST
- SHOW GLOBAL STATUS
- SHOW GLOBAL VARIABLES
- basic tools
- mysqladmin is provided with the server
- mysqladmin -i 10 extended status: will repeat the same command every 10 seconds. Pipe through grep "and smoke it" (bad pun, hah hah)
- -r: show only changed values
- MySQL Administrator
- mysqladmin is provided with the server
- cacti
- rrdtool based network graphing tool
- uses snmp
- PHP apache and MySQL based solution
- MySQL plugins, download and install
- "poller" gathers data and populates the graphs
- someone offers munin as an alternative
- not snmp based, its own agent is used
- pros
- cacti is fairly easy to configure
- cons
- could be CPU intensive with lots of machines (Perl polling seems to be the problem)
- zenoss
- complete network monitoring tool
- AJAX, integrated with Google Maps (if you don't know where the servers are, everyone laughs at this comment)
- innotop
- monitoring innodb, complex output
- developed by Baron Schwartz
- mytop
- similar to Linux's top
- filter connections on database/host being used
- queries per sec
- mtstat-mysql
- nagios
- nagios is more of a notification tool, although it can do graphs
- pain to set up, text config files, but still the most widely used tool, should be better in upcoming versions
- NagiosQL is a front end tool for nagios administration (somebody from the audience pointed this out, that's completely new to me!)
- I offer zabbix as an alternative
- hyperic is another, apparently very easy to set up, with service autodiscovery
- MySQL Enterprise Monitor (not free)
- graphs, monitoring, GUI
- built by MySQL folks themselves
- agent based
- well organized
- webyog, sqlyog, monyog, mon, heartbeat, moodss, fiveruns are added by the audience members

(No Ratings Yet)
beer planet is Artem Russakovskii's blog. Artem is a software engineer at