Hello, I was looking into the RCFile format, esp when used with compression; a picture of the file layout as I understand it in this case is attached.
Some queries/potential issues: 1. RCFile makes a claim of being sequence file compatible; but the recordLength is not the actual on-disk length of the record. As shown in the picture, it is the uncompressed key length plus the compressed value length. Similarly, the next field - key length - is not the on-disk length of the compressed key. 2. Record Length is also used for seeking on the inputstream. See Reader.seekToNextKeyBuffer(). Since record length is overstated for compressed records, this can result in incorrect positioning. 3. Thread-Safety: Is the RCFile.Reader class meant to be thread-safe? Some public methods are marked synchronized which gives that appearance but there are a few thread-safety issues I think. 3.1 Other public methods, such as Reader.nextBlock() are not synchronized which operate on the same data structures. 3.2. Callbacks such as LazyDecompressionCallbackImpl.decompress operates on the valuebuffer currentValue, which can be simultaneously modified by the public methods on the Reader. Cheers, Krishna