[ 
https://issues.apache.org/jira/browse/HIVE-2065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13014065#comment-13014065
 ] 

Krishna Kumar commented on HIVE-2065:
-------------------------------------

The RCFile layout seems to have been designed initially to be compatible to 
SequenceFile but over a period of time (esp. due to key compression 
enhancement?), it seems to have drifted away. The compatibility intent goes as 
far as to have boolean values always false etc (blockCompression), but couple 
of bugs have been introduced later whereby the recordlength is no longer the 
ondisk record length, and the keylength field is no longer the ondisk key 
length. Once I started writing a unit test for ensuring that the rcfile layout 
does stay in sync with sequence file layout, I also found that the classes 
designated as the keyclass/valueclass are no longer able to read themselves in 
or write themselves out, even if properly 'primed'. That is the primary aim of 
the changes due to #3. 

[PS. The reason I am looking into this now is experiment with column-specific 
compression ('use this codec for this sorted, numeric column') or type-specific 
compression ('use this codec for all enumerations types of this table'). 
Presumably, if successful, this information will be put into metadata as I am 
doing with the generic codec in the changes above.]

> RCFile issues
> -------------
>
>                 Key: HIVE-2065
>                 URL: https://issues.apache.org/jira/browse/HIVE-2065
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Krishna Kumar
>            Assignee: Krishna Kumar
>            Priority: Minor
>         Attachments: HIVE.2065.patch.0.txt, Slide1.png, proposal.png
>
>
> Some potential issues with RCFile
> 1. Remove unwanted synchronized modifiers on the methods of RCFile. As per 
> yongqiang he, the class is not meant to be thread-safe (and it is not). Might 
> as well get rid of the confusing and performance-impacting lock acquisitions.
> 2. Record Length overstated for compressed files. IIUC, the key compression 
> happens after we have written the record length.
> {code}
>       int keyLength = key.getSize();
>       if (keyLength < 0) {
>         throw new IOException("negative length keys not allowed: " + key);
>       }
>       out.writeInt(keyLength + valueLength); // total record length
>       out.writeInt(keyLength); // key portion length
>       if (!isCompressed()) {
>         out.writeInt(keyLength);
>         key.write(out); // key
>       } else {
>         keyCompressionBuffer.reset();
>         keyDeflateFilter.resetState();
>         key.write(keyDeflateOut);
>         keyDeflateOut.flush();
>         keyDeflateFilter.finish();
>         int compressedKeyLen = keyCompressionBuffer.getLength();
>         out.writeInt(compressedKeyLen);
>         out.write(keyCompressionBuffer.getData(), 0, compressedKeyLen);
>       }
> {code}
> 3. For sequence file compatibility, the compressed key length should be the 
> next field to record length, not the uncompressed key length.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to