[ 
https://issues.apache.org/jira/browse/HIVE-2065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13014272#comment-13014272
 ] 

He Yongqiang commented on HIVE-2065:
------------------------------------

The column-specific compression is very interesting, but it is not directly 
related to make RCFile compatible with Seqfile. We can still do that without 
this compatibility. 

Some inputs maybe useful to you:
we examined column groups, and sort the data internally based on one column in 
one column group. (But we did not try different compressions across column 
groups.) Tried this with 3-4 tables, and we see ~20% storage savings on one 
table compared the previous RCFile. The main problems for this approach is that 
it is hard to find out the correct/most efficient column group definitions.
One example, table tbl_1 has 20 columns, and user can define:

col_1,col_2,col_11,col_13:0;col_3,col_4,col_15,col_16:1;

This will put col_1, col_2,col_11, col_13 into one column group, and reorder 
that column group based on sorting col_1 (0 is the first column in this column 
group), and put col_3, col_4, col_15,col_16 into another column group, and 
reorder this column group based on sorting col_4, and finally put all other 
columns into the default column group with original order.
And should be easy to allow different compression codec for different column 
groups.

The main block issue for this approach is have a full set of utils to find out 
the best column group definition.

Instead of doing that in the existing RCFile, do you think it would be better 
if we can explore it in the new one that i just mentioned. If you think 
interesting, we can share you the existing code that we have for things i 
mentioned. And you can work on the compression codec based on the new one, and 
provide a util tool to find out the best column group definition.

what do you think?

> RCFile issues
> -------------
>
>                 Key: HIVE-2065
>                 URL: https://issues.apache.org/jira/browse/HIVE-2065
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Krishna Kumar
>            Assignee: Krishna Kumar
>            Priority: Minor
>         Attachments: HIVE.2065.patch.0.txt, Slide1.png, proposal.png
>
>
> Some potential issues with RCFile
> 1. Remove unwanted synchronized modifiers on the methods of RCFile. As per 
> yongqiang he, the class is not meant to be thread-safe (and it is not). Might 
> as well get rid of the confusing and performance-impacting lock acquisitions.
> 2. Record Length overstated for compressed files. IIUC, the key compression 
> happens after we have written the record length.
> {code}
>       int keyLength = key.getSize();
>       if (keyLength < 0) {
>         throw new IOException("negative length keys not allowed: " + key);
>       }
>       out.writeInt(keyLength + valueLength); // total record length
>       out.writeInt(keyLength); // key portion length
>       if (!isCompressed()) {
>         out.writeInt(keyLength);
>         key.write(out); // key
>       } else {
>         keyCompressionBuffer.reset();
>         keyDeflateFilter.resetState();
>         key.write(keyDeflateOut);
>         keyDeflateOut.flush();
>         keyDeflateFilter.finish();
>         int compressedKeyLen = keyCompressionBuffer.getLength();
>         out.writeInt(compressedKeyLen);
>         out.write(keyCompressionBuffer.getData(), 0, compressedKeyLen);
>       }
> {code}
> 3. For sequence file compatibility, the compressed key length should be the 
> next field to record length, not the uncompressed key length.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to