1						
						
					MySQL Conference Liveblogging: Optimizing MySQL For High Volume Data Logging Applications (Thursday 2:50PM)
Posted by Artem Russakovskii on April 17th, 2008 in Databases 
											
- http://en.oreilly.com/mysql2008/public/schedule/detail/874
 - presented by Charles Lee of Hyperic
 - Hyperic has the best performance with MySQL out of MySQL, Oracle, and Postgres in their application
 - I suddenly remember hyperic was highly recommended above nagios in MySQL Conference Liveblogging: Monitoring Tools (Wednesday 5:15PM)
 - performance bottleneck
 - the database
 - CPU
 - memory
 - IO
 - disk latency
 - network latency
 - slow queries
 - media size deployment example
 - 300 platforms (300 remote agents collecting data)
 - 2,100 servers
 - 21,000 services (10 services per server), sounds feasible
 - 468,000 metrics (20 metrics per service)
 - 28,800,000 metric data rows per day
 - larger deployments have a lot more of these (sounds crazy)
 - data
 - measurement_id
 - timestamp
 - value
 - primary key (timestamp, measurement_id)
 - data flow
 - agent collects data and sends reports to server with multiple data points
 - server batch inserts metric data points
 - if network connection fails, agent continues to collect but server "backfills" unavailable
 - when agent reconnects, spooled data overwrite backfilled data points (why not use REPLACE for all inserts?)
 - things are very basic so far
 - batch insert
 - INSERT INTO TABLE (a,b,c) VALUES (0,0,0), (1,1,1),…
 - using MySQL batch insert statements vs prepared statements with multiple queries in other databases seems to improve overall performance by 30%
 - batch inserts are limited by 'max_allowed_packet'
 - other options for increasing insert speed
 - set unique_checks=0, insert, set unique_checks=1 (definitely need to make sure data is valid first)
 - set foreign_key_checks=0, insert, set foreign_key_checks=1 (same concerns as above)
 - Hyperic doesn't use the 2 above
 - INSERT … ON DUPLICATE KEY UPDATE
 - when regular INSERT fails, retry batch with INSERT ON DUPLICATE KEY syntax
 - it's much slower but it allows
 - this is all basic, where are the performance tweaks?!
 - batch aggregate inserter
 - queue metric data from separate agent reports
 - minimize number of inserts, connections, CPU load
 - maximize workload efficiency
 - optimal configuration for 700 agents
 - 3 workers
 - 2000 batch size seems to work best
 - queue size of 4,000,000
 - this seems to peak at 2.2mil metric data inserts per minute
 - data consolidation
 - inspired by rrdtool
 - lower resolution tables track min, avg, and max
 - data compression runs hourly
 - size limit 2 days
 - every hour, data is rolled up into another table that holds hourly aggregated values with size limit 14 days, then that one gets rolled up into a monthly table, etc
 - this is is a good approach if you don't care about each data point
 - I'm overwhelmed by the amount of "you know"s from the speaker. Parasite words, ahh! Sorry Charles 🙂
 - software partitioning
 - measurement data split into 18 tables, representing 9 days (2 per day)
 - they didn't want to do more than 2 SELECTs to get data per day, hence such sharding
 - oddly, Charles didn't actually use the word 'shard' once
 - tables truncated, rather than deleting rows => huge performance boost
 - truncation vs deletion
 - deletion causes contention on rows
 - truncation doesn't produce fragmentation
 - truncation just drops and recreates the table – single DDL operation
 - indexes
 - every InnoDB table has a special index called the clustered index (based on primary key) where the physical data for the rows is stored
 - advantages
 - selects faster – row data is on the same page where the index search leads
 - inserts in (timestamp) order – avoid page splits and fragmentation
 - shows comparison between non-clustered index and clustered index (see slides)
 - still no mention of configuration tweaks
 - UNION ALL works better than inner SELECTS because the optimizer didn't optimize them enough (at least in the version these guys are using, not sure which)
 - recommended server options are on the very last slide, I was waiting for those the most! I guess I'll look up the slides after
 
In the meantime, if you found this article useful, feel free to buy me a cup of coffee below.
