[ 
https://issues.apache.org/jira/browse/HIVE-2065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13009981#comment-13009981
 ] 

Krishna Kumar commented on HIVE-2065:
-------------------------------------

Hmm. #3 is taking me a bit too far than I originally thought. I assume being 
able to read an RCFile as SequenceFile is required, while being able to write 
an RCFile via the SequenceFile interface is desirable.

Having made changes so that record length is correctly set, in order to be able 
to make sure that the rcfile is handled correctly as a sequence file, the 
following changes are also required, IIUC.

 - the second field should be the key length (4 + compressed/plain key contents)
 - the key class (KeyBuffer) must be made responsible for reading/writing the 
next field - plain key contents length - as well as compression/decompression 
of the key contents
 - the value class (ValueBuffer) related changes will be trickier. Since the 
value is not compressed as a unit, we can not use record-compressed format. We 
need to mark the records as plain records, and move the codec to a metadata 
entry. Then the valueBuffer class will work correctly with sequencefile 
implementation.

Thoughts? worth it?


> RCFile issues
> -------------
>
>                 Key: HIVE-2065
>                 URL: https://issues.apache.org/jira/browse/HIVE-2065
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Krishna Kumar
>            Assignee: Krishna Kumar
>            Priority: Minor
>         Attachments: Slide1.png, proposal.png
>
>
> Some potential issues with RCFile
> 1. Remove unwanted synchronized modifiers on the methods of RCFile. As per 
> yongqiang he, the class is not meant to be thread-safe (and it is not). Might 
> as well get rid of the confusing and performance-impacting lock acquisitions.
> 2. Record Length overstated for compressed files. IIUC, the key compression 
> happens after we have written the record length.
> {code}
>       int keyLength = key.getSize();
>       if (keyLength < 0) {
>         throw new IOException("negative length keys not allowed: " + key);
>       }
>       out.writeInt(keyLength + valueLength); // total record length
>       out.writeInt(keyLength); // key portion length
>       if (!isCompressed()) {
>         out.writeInt(keyLength);
>         key.write(out); // key
>       } else {
>         keyCompressionBuffer.reset();
>         keyDeflateFilter.resetState();
>         key.write(keyDeflateOut);
>         keyDeflateOut.flush();
>         keyDeflateFilter.finish();
>         int compressedKeyLen = keyCompressionBuffer.getLength();
>         out.writeInt(compressedKeyLen);
>         out.write(keyCompressionBuffer.getData(), 0, compressedKeyLen);
>       }
> {code}
> 3. For sequence file compatibility, the compressed key length should be the 
> next field to record length, not the uncompressed key length.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to