[jira] [Commented] (HIVE-3874) Create a new Optimized Row Columnar file format for Hive

Owen O'Malley (JIRA) Fri, 18 Jan 2013 08:54:13 -0800

    [ 
https://issues.apache.org/jira/browse/HIVE-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13557341#comment-13557341
 ]


Owen O'Malley commented on HIVE-3874:
-------------------------------------

Joydeep, I've used a two level strategy:
  * large stripes (default 250MB) to enable large efficient reads
  * relatively frequent row index entries (default 10k rows) to enable skipping 
with in a stripe

The row index entries have the locations within each column to enable seeking 
to the right compression block and byte within the decompressed block.

I obviously did consider HFile, although from a practical point of view it is 
fairly embedded within HBase. Additionally, since it treats each of the columns 
as bytes it can't do any type-specific encodings/compression and can't 
interpret the column values, which is critical for performance.

Once you have the ability to skip large sets of rows based on the filter 
predicates, you can sort the table on the secondary keys and achieve a large 
speed up. For example, if your primary partition is transaction date, you might 
want to sort the table on state, zip, and last name. Then if you are looking 
for just the records in CA it won't need to read the records for the other 
states.



                
> Create a new Optimized Row Columnar file format for Hive
> --------------------------------------------------------
>
>                 Key: HIVE-3874
>                 URL: https://issues.apache.org/jira/browse/HIVE-3874
>             Project: Hive
>          Issue Type: Improvement
>          Components: Serializers/Deserializers
>            Reporter: Owen O'Malley
>            Assignee: Owen O'Malley
>         Attachments: OrcFileIntro.pptx
>
>
> There are several limitations of the current RC File format that I'd like to 
> address by creating a new format:
> * each column value is stored as a binary blob, which means:
> ** the entire column value must be read, decompressed, and deserialized
> ** the file format can't use smarter type-specific compression
> ** push down filters can't be evaluated
> * the start of each row group needs to be found by scanning
> * user metadata can only be added to the file when the file is created
> * the file doesn't store the number of rows per a file or row group
> * there is no mechanism for seeking to a particular row number, which is 
> required for external indexes.
> * there is no mechanism for storing light weight indexes within the file to 
> enable push-down filters to skip entire row groups.
> * the type of the rows aren't stored in the file

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-3874) Create a new Optimized Row Columnar file format for Hive

Reply via email to