What I was missing is that the codec sets the buffer size of the stream to 
IO_COMPRESSION_CODEC_SNAPPY_BUFFERSIZE_KEY, so the buffer sizes match closely.

    - Tim.

________________________________________
From: Tim Broberg
Sent: Thursday, January 26, 2012 12:56 PM
To: common-dev@hadoop.apache.org
Subject: Snappy compression block sizes

I'm confused about the disparity of block sizes between BlockCompressorStream 
and SnappyCompressor.

BlockCompressorStream has default MAX_INPUT_SIZE on the order of 512 bytes, 
whereas SnappyCompressor has IO_COMPRESSION_CODEC_SNAPPY_BUFFERSIZE_DEFAULT of 
256kB.

In BlockCompressorStream.write() (reproduced below), I see no case where we can 
ever write more than MAX_INPUT_SIZE to the compressor before calling 
compressor.finish(), flushing the output, and resetting.

So, if we only ever process 512 bytes at a time, why do we have 256k of buffer 
in the compressor?

Shouldn't we be flushing every 256kB, not every 1/2 kB?

I feel like I must be missing something obvious or this would be getting 
terrible compression since we would have only 256 bytes of compression history 
available on average in Snappy (and lz4).

What am I missing?

TIA,
    - Tim.

    long limlen = compressor.getBytesRead();
    if (len + limlen > MAX_INPUT_SIZE && limlen > 0) {
      // Adding this segment would exceed the maximum size.
      // Flush data if we have it.
      finish();
      compressor.reset();
    }

    if (len > MAX_INPUT_SIZE) {
      // The data we're given exceeds the maximum size. Any data
      // we had have been flushed, so we write out this chunk in segments
      // not exceeding the maximum size until it is exhausted.
      rawWriteInt(len);
      do {
        int bufLen = Math.min(len, MAX_INPUT_SIZE);
        compressor.setInput(b, off, bufLen);
        compressor.finish();
        while (!compressor.finished()) {
          compress();
        }
        compressor.reset();
        off += bufLen;
        len -= bufLen;
      } while (len > 0);
      return;
    }

The information and any attached documents contained in this message
may be confidential and/or legally privileged.  The message is
intended solely for the addressee(s).  If you are not the intended
recipient, you are hereby notified that any use, dissemination, or
reproduction is strictly prohibited and may be unlawful.  If you are
not the intended recipient, please contact the sender immediately by
return e-mail and destroy all copies of the original message.

Reply via email to