MySQL Conference Liveblogging: Introduction To The BLOB Streaming Project (Wednesday 3:00PM)
Wednesday, April 16th, 2008
- Paul McCullagh presents
- BLOB
- invented by Jim Starkey
- Basic Large OBject
- Binary Large OBject
- photos, films, mp4 files, pdfs, etc
- how MySQL handles BLOBs
- mysql client send buffer -> receive buffer on the server (max_allowed_packet)
- streaming a BLOB
- continuous data stream
- stream BLOB data directly in and out of the database
- store BLOBs of any size (>4GB) in the database
- create a scalable back-end that can handle any throughput and storage requirements. Wouldn't need to know in advance how big the database will get
- provide an open system that can be used by all engines
- provide extensions for BLOB streaming to existing MySQL clients
- why put BLOBs in the database?
- referential integrity (no invalid references), can take a lot of time to sort out missing data
- all data in one place - structured and unstructured. Good for testing
- small BLOBs are handled better by databases
- backups are consistent
- BLOBs can be replicated
- why "not to BLOB" (currently)?
- a BLOB column makes a table slow
- big rows in memory
- sequential scans are not possible
- database becomes too big
- cannot be copied easily
- backups become slow
- space not freed on delete
- database doesn't scale well
- solution to these problems
- a separate BLOB repository, outside of database rows
- references are stored in the table
- allows for incremental backups
- automatic defrag and compaction
- BLOBs not written to binlogs
- repository can be scaled-out
- Paul shows BLOB streaming architecture
- BLOB storage engine (MyBS, heh heh, Baron especially likes this acronym)
- PBXT - streaming enabled engine or other storage engines
- HTTP-based API extension
- BLOB streaming engine
- does not provide conventional table storage (CREATE TABLE syntax only for system tables)
- built-in HTTP server (port 8080 by default)
- BLOB repository
- provides a server-side API
- Temp BLOB timeout
- BLOBs that are not referenced are deleted from the repository
- initially all BLOBs don't have references to them yet
- timeout is used (mysql_temp_blob_timeout)
- Paul shows a very interesting demo
- creates a table of type PBXT
- uploads an image into the BLOB engine directly (using curl)
- gets back a reference id for the newly inserted BLOB
- accesses the image right in the browser using a GET request with the reference id from above
- Paul does a SELECT on the BLOB engine and gets back the earlier inserted BLOB with such things as BLOB size, content type (MIME), and EXIF data
- now Paul shows what happens if you insert a BLOB and don't access it. It then gets automatically purged from the BLOB engine after the BLOB timeout (mentioned above)
- I like what I'm seeing so far. Very interesting work, Paul. Definitely worth checking out the actual slides
- client-side extensions
- JDBC already works get/setBinaryStream(), get/setBlob()
- Google Summer of Code 2008 to add connector to PHP
- still to be done
- mysqlclient library - simple extension to do HTTP GET and PUT
- other languages (Perl, Ruby, etc)
- backup
- mysqldump will dump repository names and references
- future plans
- distributed repository
- someone asks a question about BLOB repository's performance and benchmarks
- Paul has a convenient benchmark of a BLOB engine compared to xfs
- on the chart, creating and reading BLOBs stays practically the same no matter how many objects are present (in the engine or on the file system), while the file system starts high but quickly degrades almost to 0
- Why does the file system degrade to 0? Did you put all the files in the same directory? This looks very odd
- the next slide shows that when adding threads, the BLOB engine yet again performs the same and scales well while the file system starts degrading after 4-8 threads
- why is the performance better than, let's say, lighttpd pulling from a file system?
- http://www.blobstreaming.org
- http://sourceforge.net/projects/mybs
- http://pbxt.blogspot.com
MySQL Conference Liveblogging: MySQL Performance Under A Microscope: The Tobias And Jay Show (Wednesday 2:00PM)
Wednesday, April 16th, 2008
- Jay Pipes, Tobias Asplund
- Finding out the number of rows that would have been returned (MyISAM and InnoDB)
- SQL_CALC_FOUND_ROWS and FOUND_ROWS()
- COUNT(*)
- MEMORY table
- if query cache is on, then it makes no difference
- if it's off
- Memory MyISAM is fastest
- FOUND_ROWS() is slightly slower than count(*)
- more in the slides that I'll add later
- quite a lot of humor, these guys are fun
- query union vs index_merge union
- SELECT … WHERE a UNION SELECT … WHERE b
vs
SELECT … WHERE a AND b - index_merge wins
- composite index vs index merge
- composite index is faster
- of course, multiple indexes are more flexible than composite index
- sort union vs composite index
- …
- unix time (int unsigned) vs datetime
- query cache disabled
- 100k rows
- now and 20 years back
- retrieval range between 50 and 365 days
- all data fits into memory
- UNIX_TIMESTAMP is worst
- DATETIME in the middle
- strtotime() fastest
- is BENCHMARK() a good simulation test?
- NOW()
- MONTH(NOW())
- MONTH('YYYY-MM-DD')
- DATE_FORMAT()
- SUBSTRING()
- all about the same if php benchmark script is used
- BENCHMARK() is very unreliable, the results vary a lot
- X in Y how many times?
- 3 different methods on mysql forge
- 2 UDFs and 1 using native functions
- query cache off
- 1K rows
- all rows have "/"
- count the number of "/"
- Marcelo's solution "COUNT_CHAR_OCCURRENCE": 4th place
- Garrett Hill's solution "countstring" (it's technically slightly wrong, because it doesn't account for multibyte characters): 2nd place
- Roland Bouman's solution using mysql functions: 1st place
- Jay's way: function in php: 3rd place
- MEMORY engine HASH index vs BTREE index
- BTREE is faster
- HASH uses less memory
MySQL Conference Liveblogging: Applied Partitioning And Scaling your (OLTP) Database System (Wednesday 11:55AM)
Wednesday, April 16th, 2008
- Phil Hilderbrand of thePlatform for Media, Inc presents
- classic partitioning
- old school - union in the archive tables
- auto partitioning and partition pruning
- great for data warehousing
- query performance improved
- maintenance is clearly improved
- design issues in applying partitioning to OLTP (On-Line Transaction Processing)
- often id driven access vs date driven access
- 1 big clients could be 80% of the whole database, so there's a difficulty selecting partitioning schemes
- partitioning is only supported starting from MySQL 5.1
- understanding the benefits
- reducing seek and scan set sizes
- improving inserts/updates durations
- making maintenance easier
- shows an EXPLAIN output for SELECTS on non-partitioned and partitioned tables. The results are significantly better for partitions
- OPTIMIZE TABLE on an unpartitioned table takes 1.14s
- ALTER TABLE REBUILD PARTITION p1; on a partitioned table takes 0.03s
- ALTER TABLE REBUILD PARTITION p1, p2, p3, p4, …, p10; takes 0.27s
- design consideration
- table sizes and predicted growth patterns - partition big tables and also partition in advance, if you predict quick growth
- access patterns - select what you want to partition by in a smart way, date, id, etc
- keys and indexes - there are a few restrictions, foreign keys are currently not supported
- availability requirements
- manageability considerations - choosing to partition by hash if there is a TON of data
- reuse / scope considerations - think ahead, think of the usage
- partitioning methods
- range partitioning
- data usually accessed by date
- limited number of primary partitions needed
- ordered intelligent keys
- support sub partitions
- list partitioning
- grouping data in partitions out of order (1,5,7 in partition x)
- limited number of primary partitions needed
- intelligent keys
- supports sub partitions
- hash partitioning
- low maintenance
- works with limited or large number of partitions
- non-intelligent keys (can work in some cases with intelligent keys)
- key partitioning
- non-integer based partitioned keys (MySQL converts to int for you)
- low maintenance
- hash partitioning example
- hash(mod%num_partitions)
- in this example, Phil has stores, employees, and inventory. He decided to partition by store.
- http://dev.mysql.com/doc/refman/5.1/en/partitioning-management-hash-key.html
- 50 stores
- ALTER TABLE my_store PARTITION BY HASH(id) PARTITIONS 50;
- ALTER TABLE my_employee PARTITION BY HASH(store_id) PARTITIONS 50;
- ALTER TABLE my_inventory PARTITION BY HASH(store_id) PARTITIONS 50;
- ALTER obviously takes a long time and blocks (grr)
- adding partitions
- ALTER TABLE my_store ADD PARTITION PARTITIONS 2;
- ALTER TABLE my_employee ADD PARTITION PARTITIONS 2;
- ALTER TABLE my_inventory ADD PARTITION PARTITIONS 2;
- ALTER takes some time again, though less (how come if the partitions are empty?)
- SELECT table_name, partition_name, table_rows FROM information_schema.partitions … shows info on partitions
- remove 4 partitions
- ALTER TABLE my_store COALESCE PARTITION 4;
- ALTER TABLE my_employee COALESCE PARTITION 4;
- ALTER TABLE my_inventory COALESCE PARTITION 4;
MySQL Conference Liveblogging: Portable Scale-out Benchmarks For MySQL (Wednesday 10:50AM)
Wednesday, April 16th, 2008
- Robert Hodges from Continuent presents
- About Continuent
- leading provider of open source database availability and scaling solutions
- solutions
- uni/cluster - multi-master database clustering that replicates data across multiple databases and load balances reads
- uses "database virtualization"
- scale-out design motivation
- protection from db and site failures
- continuous operation during upgrades
- how come not everyone has it already?
- creating identical replicas across different hosts is hard
- Brewer's conjecture
- trade-offs
- DDL support
- inconsistent reads between replicas
- deadlocks
- sequences
- non-deterministic SQL
- therefore many scale-out approaches are non-transparent
- 3 basic scale-out technologies
- data replication
- where are updates processed? master/master vs master/slave
- when are updates replicated? sync vs async
- group communication - coordinates messages between distributed processes
- views - who is active, who is crashed, do we have quorum, etc
- message delivery - ordering and delivery guarantees
- proxying - virtualizes databases and hides database locations from applications
- latency, performance?
- 3 replication algorithms
- master/slave - accept updates at a single master and replicate changes to one or more slaves
- multi-master state machine - deliver a stream of updates in the same order simultaneously to a set of databases
- certification - optimistically execute transactions on one of a number of nodes and then apply to all nodes after confirming serialization. Currently not in MySQL but developed by Continuent (presenter's company)
- performance testing strategy
- run appropriate tests
- mixed load tests to check overall throughput and scaling
- micro-benchmarks to focus on specific issues
- use appropriate workloads
- scale-out use profiles are often read or write intensive
- cover key issues
- read latency through proxies
- read and write scaling
- slave latency for master/slave configurations
- group communication and replication bottlenecks
- aborts and deadlocks
- generate sufficient load in the right places
- many transactions/queries
- large data sets
- data types
- Bristlecone
- http://bristlecone.continuent.org
- open source
- svn checkout svn://forge.continuent.org/bristlecone/trunk/bristlecone bristlecone
- load test
- batch transaction loading
- micro-benchmarks
- Bristlecone Load Testing: Evaluator
- Java tool to generate mixed load on databases
- similar to pgbench but works cross-DBMS (how about sysbench?)
- can easily vary mix of select, insert, update, delete statements
- default select statement designed to "exercise" the db
- can choose lightweight queries as well
- parameters are defined in a simple config file
- can generate reports
- shows sample config file (xml) that generates 500 clients, lasts 600 seconds. Looks quite simple but very proprietary. Examples are included in the download.
- Evaluator Graphical Output
- shows a graph of requests/s and response time, very standard looking, updates live while the test is running, last 10 minutes are visible.
- Bristlecone Micro-Benchmarks: Benchmark
- Java tool to test specific operations while systematically varying parameters
- benchmarks run "scenarios" - specialized Java classes with interfaces similar to JUnit
- shows config file, java properties file this time instead of xml, you can vary a few parameters that will spawn multple variations of the test (cross join between all variations)
- current micro benchmarks
- basic read latency - low db stress
- ReadSimpleScenario
- ReadSimpleLargeScenario
- read scaling - high db stress
- ReadScalingAggregatesScenario
- ReadScalingInvertedKeysScenario
- write latency and scaling - low/high stress
- …
- deadlocks - variable transaction lenghts
- DeadLockScenario
- TPC-B scenario will be added shortly
- shows html output, simple table layout, easy to look at or load into a pivot table in Excel
- Bristlecone Testing Examples
- shows a mixed load query throughput test output graph between a standalone server and a 2-node cluster. cluster is approximately twice as productive as the standalone server
- shows a mixed load query response test output between the same standalone server and a 2-node cluster. The standalone server is visibly choking while the cluster is smooth
- shows a proxy query throughput against MySQL 5.1.23, MySQL Proxy 0.6.1, Myosotis Connector proxy, and uni/cluster proxy. MySQL 5.1.23 is significantly faster than any proxy. MySQL Proxy is the worst performing one, even though it's written in C and the others are in Java. Robert thinks it's due to Java handling multi-threading better than C
- shows a read scaling test output for a query that does SELECT COUNT(*) with 200 rows. MySQL 5.1.23 beats uni/cluster proxy until it passes 4 threads, where the proxy beats it.
- All tests used InnoDB
- shows a MySQL replication master overhead test results comparing a inserts per second on a single master vs a master with a slave. The master with a slave is about 30% slower. Peter Zaitsev raises an interesting question of the differences between just having the binlog turned on vs having it turned on AND a slave replicating. These differences weren't tested by the presenter and he's unsure on the result
- shows a replication latency MySQL vs Postgres test results, in which Postgres actually kicks MySQL's ass. A replica with default InnoDB settings performs very badly compared to tweaked settings (about 70% slower)
MySQL Conference Liveblogging: Disaster Is Inevitable - Are You Prepared? (Tuesday 4:25PM)
Tuesday, April 15th, 2008
- Suicide
- having no backups
- depending on slaves for backup
- keeping backups on same SAN
- having a single DBA - Frank didn't like this one at all
- not keeping binlogs
- Restoring from backup
- how much time?
- uncompressed backup ready to mount?
- separate network for recovery?
- In Fotolog, 1TB of data was severely hit.
- first problem: backup was highly compressed (tar.gz)
- uncompressing took hours
- so keep uncompressed backups (at least last N days)
- it should be mountable, rather than transferable
- Frank going over recovery modes at http://dev.mysql.com/doc/refman/5.0/en/forcing-recovery.html
- Row by row recovery
- row by row recovery (get the range of ids)
- custom scripts
- may not be able to use primary key
- foreign key based retrieval faster
- lose 4 seconds for each crashed record (in Fotolog, for some reason some values were crashing mysqld)
- Lessons
- SANs make sense (in some environments)
- try to replicate the whole SAN (in Fotolog, a SAN actually failed because of a bug in its maintenance program)
- everything will fail at some point
- backup everything (cron jobs, my.cnf, custom scripts)
- have backup in a form ready to restore
- don't count replication a backup
- be worried about 'routine' operations
- Peter Zaitsev of Percona takes the stage to talk about his homegrown tools for InnoDB recovery
- innodb-tools - will recover even if mysqld doesn't start, for example if half of RAID0 fails or somebody deleted some data. innodb-tools will recover using InnoDB tablespaces.
- We're out of time

(No Ratings Yet)
beer planet is Artem Russakovskii's blog. Artem is a software engineer at