[vox-tech] reading a .gz .Z after offset

Jeff Newmiller vox-tech@lists.lugod.org
Thu, 7 Mar 2002 23:25:47 -0800 (PST)


On Thu, 7 Mar 2002, Mark K. Kim wrote:

> On Thu, 7 Mar 2002, Jeff Newmiller wrote:
> 
> > I don't think you can do seeks in a compressed file... you have to read it
> > sequentially.
> >
> > If you have a plan for dividing up the uncompressed data, perhaps you
> > should do that first and store the split data as separate files
> > (recompressed or not) for purposes of computation.
> 
> The zlib library offers a seek function in its utility function API,
> "gzseek(gzFile, z_off_t, int)".  Since the zlib compression uses the
> deflation algorithm that compresses data in blocks of a known size, it can
> find the block you're seeking, inflate just that block, and return the
> data (I'm not sure if that's how gzseek works, but I'm just sain' it can
> be done.)  I'm sure in all PERL's ingenuity, it can be done in PERL, too.
> 
> Go Eric!  Keep looking! :)

Thanks for the sanity check, Mark...

I took a look at the source for gzseek
(http://www.cs.washington.edu/homes/suciu/XMLTK/xmill/www/XMILL/html/gzio_8c-source.html
line 675), and it just sequentially reads blocks out of the decompression
algorithm beginning from the current position (if the desired location is
after the current location) or from the beginning of the file (if the
desired location is before the current location). (There is an fseek in
there, but that only gets used if the file is not a compressed file.)

I looked at Compress::Zlib, and it seems to omit the interface to gzseek
(I don't know why).  However, you could re-implement it fairly easily
using gzread (where your destination is d=n*m+r, read n blocks of m bytes,
and then read r bytes, and then begin reading data).

Thus it does require a sequential read, but that shouldn't stop you from
extracting any given segment of the file you want.  As long as your actual
data processing is more computationally expensive than the decompression
process, the fact that multiple processors are sequentially decompressiong
and skipping over earlier portions of the file shouldn't be a big deal.

The basic assumption remains that you know where you want to seek
to... that is, you need fixed length records or a program that finds the
appropriate seek offsets and data lengths to give to each parallel
process.

---------------------------------------------------------------------------
Jeff Newmiller                        The     .....       .....  Go Live...
DCN:<jdnewmil@dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live Go...
                                      Live:   OO#.. Dead: OO#..  Playing
Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
/Software/Embedded Controllers)               .OO#.       .OO#.  rocks...2k
---------------------------------------------------------------------------