[ https://issues.apache.org/jira/browse/HIVE-2065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13014065#comment-13014065 ]
Krishna Kumar commented on HIVE-2065: ------------------------------------- The RCFile layout seems to have been designed initially to be compatible to SequenceFile but over a period of time (esp. due to key compression enhancement?), it seems to have drifted away. The compatibility intent goes as far as to have boolean values always false etc (blockCompression), but couple of bugs have been introduced later whereby the recordlength is no longer the ondisk record length, and the keylength field is no longer the ondisk key length. Once I started writing a unit test for ensuring that the rcfile layout does stay in sync with sequence file layout, I also found that the classes designated as the keyclass/valueclass are no longer able to read themselves in or write themselves out, even if properly 'primed'. That is the primary aim of the changes due to #3. [PS. The reason I am looking into this now is experiment with column-specific compression ('use this codec for this sorted, numeric column') or type-specific compression ('use this codec for all enumerations types of this table'). Presumably, if successful, this information will be put into metadata as I am doing with the generic codec in the changes above.] > RCFile issues > ------------- > > Key: HIVE-2065 > URL: https://issues.apache.org/jira/browse/HIVE-2065 > Project: Hive > Issue Type: Bug > Reporter: Krishna Kumar > Assignee: Krishna Kumar > Priority: Minor > Attachments: HIVE.2065.patch.0.txt, Slide1.png, proposal.png > > > Some potential issues with RCFile > 1. Remove unwanted synchronized modifiers on the methods of RCFile. As per > yongqiang he, the class is not meant to be thread-safe (and it is not). Might > as well get rid of the confusing and performance-impacting lock acquisitions. > 2. Record Length overstated for compressed files. IIUC, the key compression > happens after we have written the record length. > {code} > int keyLength = key.getSize(); > if (keyLength < 0) { > throw new IOException("negative length keys not allowed: " + key); > } > out.writeInt(keyLength + valueLength); // total record length > out.writeInt(keyLength); // key portion length > if (!isCompressed()) { > out.writeInt(keyLength); > key.write(out); // key > } else { > keyCompressionBuffer.reset(); > keyDeflateFilter.resetState(); > key.write(keyDeflateOut); > keyDeflateOut.flush(); > keyDeflateFilter.finish(); > int compressedKeyLen = keyCompressionBuffer.getLength(); > out.writeInt(compressedKeyLen); > out.write(keyCompressionBuffer.getData(), 0, compressedKeyLen); > } > {code} > 3. For sequence file compatibility, the compressed key length should be the > next field to record length, not the uncompressed key length. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira