Wishlist for data compression formats: fast seeking

We take for granted that plain files give you constant-time seek. You call fseek(), the hard drive controller gets a command telling it to seek its head to a given offset, and you read some bytes.

With files compressed with gzip and the like, there is no such ability. Gzip is essentially a VBR compression scheme, so while you could easily seek to a specific byte in the compressed stream, there is know way to automatically know how to find an offset in the uncompressed stream.

This is no reason to admit defeat, though. A strategy used by the FLAC lossless audio compression format is to build seek tables. A seek table is a simple mapping of (uncompressed stream offset -> compressed stream offset) entries. For example, an entry (1055 -> 400) in the table would mean “byte 1055 in the uncompressed stream is encoded into a block starting at byte 400 in the compressed stream.” So if you want to seek to offset X in the uncompressed stream, you find the smallest seek table entry with uncompressed stream offset <= X, physically seek to the corresponding compressed stream offset, and uncompress that block.

If you have a regularly-occurring seek table entries, this gets you constant-time seek even within your compressed files!

You could probably using the existing zlib API to build an external seek table of this sort (eg. in a different file). But it would be much nicer if this was built into the format.

This entry was posted in Uncategorized. Bookmark the permalink.

One Response to Wishlist for data compression formats: fast seeking

  1. Matt Brubeck says:

    dictzip produces a gzip file with a seek table stored internally. The comments here have some interesting details.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre lang="" line="" escaped="" highlight="">