MySQL Conference Liveblogging: Introduction To The BLOB Streaming Project (Wednesday 3:00PM) « Artem Russakovskii's programming and technology blog

MySQL Conference Liveblogging: Introduction To The BLOB Streaming Project (Wednesday 3:00PM)

Posted by Artem Russakovskii on April 16th, 2008 in Databases

Paul McCullagh presents
BLOB

invented by Jim Starkey
Basic Large OBject
Binary Large OBject
photos, films, mp4 files, pdfs, etc

how MySQL handles BLOBs

mysql client send buffer -> receive buffer on the server (max_allowed_packet)
streaming a BLOB

continuous data stream
stream BLOB data directly in and out of the database
store BLOBs of any size (>4GB) in the database
create a scalable back-end that can handle any throughput and storage requirements. Wouldn't need to know in advance how big the database will get
provide an open system that can be used by all engines
provide extensions for BLOB streaming to existing MySQL clients

why put BLOBs in the database?

referential integrity (no invalid references), can take a lot of time to sort out missing data
all data in one place – structured and unstructured. Good for testing
small BLOBs are handled better by databases
backups are consistent
BLOBs can be replicated

why "not to BLOB" (currently)?

a BLOB column makes a table slow

big rows in memory
sequential scans are not possible

database becomes too big

cannot be copied easily
backups become slow
space not freed on delete
database doesn't scale well

solution to these problems

a separate BLOB repository, outside of database rows
references are stored in the table
allows for incremental backups
automatic defrag and compaction
BLOBs not written to binlogs
repository can be scaled-out

Paul shows BLOB streaming architecture

BLOB storage engine (MyBS, heh heh, Baron especially likes this acronym)
PBXT – streaming enabled engine or other storage engines
HTTP-based API extension

BLOB streaming engine

does not provide conventional table storage (CREATE TABLE syntax only for system tables)
built-in HTTP server (port 8080 by default)
BLOB repository
provides a server-side API

Temp BLOB timeout

BLOBs that are not referenced are deleted from the repository
initially all BLOBs don't have references to them yet
timeout is used (mysql_temp_blob_timeout)

Paul shows a very interesting demo

creates a table of type PBXT
uploads an image into the BLOB engine directly (using curl)
gets back a reference id for the newly inserted BLOB
accesses the image right in the browser using a GET request with the reference id from above
Paul does a SELECT on the BLOB engine and gets back the earlier inserted BLOB with such things as BLOB size, content type (MIME), and EXIF data
now Paul shows what happens if you insert a BLOB and don't access it. It then gets automatically purged from the BLOB engine after the BLOB timeout (mentioned above)
I like what I'm seeing so far. Very interesting work, Paul. Definitely worth checking out the actual slides

client-side extensions

JDBC already works get/setBinaryStream(), get/setBlob()
Google Summer of Code 2008 to add connector to PHP
still to be done

mysqlclient library – simple extension to do HTTP GET and PUT
other languages (Perl, Ruby, etc)

backup

mysqldump will dump repository names and references

future plans

distributed repository

someone asks a question about BLOB repository's performance and benchmarks

Paul has a convenient benchmark of a BLOB engine compared to xfs
on the chart, creating and reading BLOBs stays practically the same no matter how many objects are present (in the engine or on the file system), while the file system starts high but quickly degrades almost to 0
Why does the file system degrade to 0? Did you put all the files in the same directory? This looks very odd
the next slide shows that when adding threads, the BLOB engine yet again performs the same and scales well while the file system starts degrading after 4-8 threads
why is the performance better than, let's say, lighttpd pulling from a file system?

http://www.blobstreaming.org
http://sourceforge.net/projects/mybs
http://pbxt.blogspot.com

● ● ● Artem Russakovskii is a San Francisco programmer and blogger. Follow Artem on Twitter (@ArtemR) or subscribe to the RSS feed.

In the meantime, if you found this article useful, feel free to buy me a cup of coffee below.

Tags: blob, conference, MySQL, paul mccullagh, pbxt, primebase xt, streaming

Artem Russakovskii's programming and technology blog

MySQL Conference Liveblogging: Introduction To The BLOB Streaming Project (Wednesday 3:00PM)

About Me

Pages

Categories

My Sites

Recent Comments

Artem Russakovskii's programming and technology blog

MySQL Conference Liveblogging: Introduction To The BLOB Streaming Project (Wednesday 3:00PM)

About Me

Pages

Categories

Tag Cloud

My Sites

Recent Comments