• Paul McCullagh presents
  • BLOB
    • invented by Jim Starkey
    • Basic Large OBject
    • Binary Large OBject
    • photos, films, mp4 files, pdfs, etc
  • how MySQL handles BLOBs
    • mysql client send buffer -> receive buffer on the server (max_allowed_packet)
    • streaming a BLOB
      • continuous data stream
      • stream BLOB data directly in and out of the database
      • store BLOBs of any size (>4GB) in the database
      • create a scalable back-end that can handle any throughput and storage requirements. Wouldn't need to know in advance how big the database will get
      • provide an open system that can be used by all engines
      • provide extensions for BLOB streaming to existing MySQL clients
  • why put BLOBs in the database?
    • referential integrity (no invalid references), can take a lot of time to sort out missing data
    • all data in one place – structured and unstructured. Good for testing
    • small BLOBs are handled better by databases
    • backups are consistent
    • BLOBs can be replicated
  • why "not to BLOB" (currently)?
    • a BLOB column makes a table slow
      • big rows in memory
      • sequential scans are not possible
    • database becomes too big
      • cannot be copied easily
      • backups become slow
      • space not freed on delete
      • database doesn't scale well
  • solution to these problems
    • a separate BLOB repository, outside of database rows
    • references are stored in the table
    • allows for incremental backups
    • automatic defrag and compaction
    • BLOBs not written to binlogs
    • repository can be scaled-out
  • Paul shows BLOB streaming architecture
    • BLOB storage engine (MyBS, heh heh, Baron especially likes this acronym)
    • PBXT – streaming enabled engine or other storage engines
    • HTTP-based API extension
  • BLOB streaming engine
    • does not provide conventional table storage (CREATE TABLE syntax only for system tables)
    • built-in HTTP server (port 8080 by default)
    • BLOB repository
    • provides a server-side API
  • Temp BLOB timeout
    • BLOBs that are not referenced are deleted from the repository
    • initially all BLOBs don't have references to them yet
    • timeout is used (mysql_temp_blob_timeout)
  • Paul shows a very interesting demo
    • creates a table of type PBXT
    • uploads an image into the BLOB engine directly (using curl)
    • gets back a reference id for the newly inserted BLOB
    • accesses the image right in the browser using a GET request with the reference id from above
    • Paul does a SELECT on the BLOB engine and gets back the earlier inserted BLOB with such things as BLOB size, content type (MIME), and EXIF data
    • now Paul shows what happens if you insert a BLOB and don't access it. It then gets automatically purged from the BLOB engine after the BLOB timeout (mentioned above)
    • I like what I'm seeing so far. Very interesting work, Paul. Definitely worth checking out the actual slides
  • client-side extensions
    • JDBC already works get/setBinaryStream(), get/setBlob()
    • Google Summer of Code 2008 to add connector to PHP
    • still to be done
      • mysqlclient library – simple extension to do HTTP GET and PUT
      • other languages (Perl, Ruby, etc)
  • backup
    • mysqldump will dump repository names and references
  • future plans
    • distributed repository
  • someone asks a question about BLOB repository's performance and benchmarks
    • Paul has a convenient benchmark of a BLOB engine compared to xfs
    • on the chart, creating and reading BLOBs stays practically the same no matter how many objects are present (in the engine or on the file system), while the file system starts high but quickly degrades almost to 0
    • Why does the file system degrade to 0? Did you put all the files in the same directory? This looks very odd
    • the next slide shows that when adding threads, the BLOB engine yet again performs the same and scales well while the file system starts degrading after 4-8 threads
    • why is the performance better than, let's say, lighttpd pulling from a file system?
  • http://www.blobstreaming.org
  • http://sourceforge.net/projects/mybs
  • http://pbxt.blogspot.com
● ● ●
Artem Russakovskii is a San Francisco programmer and blogger. Follow Artem on Twitter (@ArtemR) or subscribe to the RSS feed.

In the meantime, if you found this article useful, feel free to buy me a cup of coffee below.