Hello,

    I was looking into the RCFile format, esp when used with compression; a
picture of the file layout as I understand it in this case is attached.

    Some queries/potential issues:

    1. RCFile makes a claim of being sequence file compatible; but the
recordLength is not the actual on-disk length of the record. As shown in the
picture, it is the uncompressed key length plus the compressed value length.
Similarly, the next field - key length - is not the on-disk length of the
compressed key.

    2. Record Length is also used for seeking on the inputstream. See
Reader.seekToNextKeyBuffer(). Since record length is overstated for
compressed records, this can result in incorrect positioning.

    3. Thread-Safety: Is the RCFile.Reader class meant to be thread-safe?
Some public methods are marked synchronized which gives that appearance but
there are a few thread-safety issues I think.

        3.1 Other public methods, such as Reader.nextBlock() are not
synchronized which operate on the same data structures.

        3.2. Callbacks such as LazyDecompressionCallbackImpl.decompress
operates on the valuebuffer currentValue, which can be simultaneously
modified by the public methods on the Reader.

Cheers,
 Krishna

Reply via email to