Hi yongqiang he, Have created a bug https://issues.apache.org/jira/browse/HIVE-2065 to carry on the discussion. Have attached the picture there too: https://issues.apache.org/jira/secure/attachment/12474055/Slide1.png. (looks like attachments are stripped from posts here?)
Please comment there. Cheers, Krishna On 3/18/11 11:47 PM, "yongqiang he" <heyongqiang...@gmail.com> wrote: >> but the recordLength is not the actual on-disk length of the record. It is acutal on-disk length. It is compressed key length plus the compressed value length >>Similarly, the next field - key length - is not the on-disk length of the >>compressed key. There are two keyLengths, one is compressed key length, the other is uncompressed keyLength For 2, it wo't be a problem. record length is compressed length >>Thread-Safety. It is not thread safe. Application should do it themselves. It is initially designed for Hive. Thread safety is there at first time, and then removed because Hive does not need that, and 'synchronized' may need extra overhead >>3.1 Reader.nextBlock() is later added for file merge. So the normal reader should not use this method. >>3.2. True. On Fri, Mar 18, 2011 at 8:30 AM, Krishna Kumar <krish...@yahoo-inc.com> wrote: > Hello, > > I was looking into the RCFile format, esp when used with compression; a > picture of the file layout as I understand it in this case is attached. > > Some queries/potential issues: > > 1. RCFile makes a claim of being sequence file compatible; but the > recordLength is not the actual on-disk length of the record. As shown in the > picture, it is the uncompressed key length plus the compressed value length. > Similarly, the next field - key length - is not the on-disk length of the > compressed key. > > 2. Record Length is also used for seeking on the inputstream. See > Reader.seekToNextKeyBuffer(). Since record length is overstated for > compressed records, this can result in incorrect positioning. > > 3. Thread-Safety: Is the RCFile.Reader class meant to be thread-safe? > Some public methods are marked synchronized which gives that appearance but > there are a few thread-safety issues I think. > > 3.1 Other public methods, such as Reader.nextBlock() are not > synchronized which operate on the same data structures. > > 3.2. Callbacks such as LazyDecompressionCallbackImpl.decompress > operates on the valuebuffer currentValue, which can be simultaneously > modified by the public methods on the Reader. > > Cheers, > Krishna > >