[jira] [Commented] (HIVE-3874) Create a new Optimized Row Columnar file format for Hive

Owen O'Malley (JIRA) Thu, 10 Jan 2013 08:42:13 -0800

    [ 
https://issues.apache.org/jira/browse/HIVE-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13549774#comment-13549774
 ]


Owen O'Malley commented on HIVE-3874:
-------------------------------------

He Yongqiang, the APIs to the two formats are significantly different. It would 
be possible to extend the RCFile reader to recognize an ORC file and to have it 
delegate to the ORC File reader.

The other direction (having the ORC file reader parse an RCFile) isn't 
possible, because ORC provides operations that would be very expensive or 
impossible to implement in RCFile.

One concern with making the RCFile reader delegate to the ORC file reader is 
that RCFile returns binary values that are interpreted by the serde while in 
ORC deserialization happens in the reader. Therefore, either the adaptor would 
need to re-serialize the data or would require changes in the serde as well.
                
> Create a new Optimized Row Columnar file format for Hive
> --------------------------------------------------------
>
>                 Key: HIVE-3874
>                 URL: https://issues.apache.org/jira/browse/HIVE-3874
>             Project: Hive
>          Issue Type: Improvement
>          Components: Serializers/Deserializers
>            Reporter: Owen O'Malley
>            Assignee: Owen O'Malley
>         Attachments: OrcFileIntro.pptx
>
>
> There are several limitations of the current RC File format that I'd like to 
> address by creating a new format:
> * each column value is stored as a binary blob, which means:
> ** the entire column value must be read, decompressed, and deserialized
> ** the file format can't use smarter type-specific compression
> ** push down filters can't be evaluated
> * the start of each row group needs to be found by scanning
> * user metadata can only be added to the file when the file is created
> * the file doesn't store the number of rows per a file or row group
> * there is no mechanism for seeking to a particular row number, which is 
> required for external indexes.
> * there is no mechanism for storing light weight indexes within the file to 
> enable push-down filters to skip entire row groups.
> * the type of the rows aren't stored in the file

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-3874) Create a new Optimized Row Columnar file format for Hive

Reply via email to