[ 
https://issues.apache.org/jira/browse/HIVE-5632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13805650#comment-13805650
 ] 

Prasanth J commented on HIVE-5632:
----------------------------------

[~ehans] Thanks for taking a look at this patch. I will address your review 
comments in the next patch.

Regarding your question above,
ORC already stores hierarchical min/max metadata. At the lowest level, ORC 
stores min/max for every 10,000 rows (called as rowgroups). The size of the 
rowgroup can be configured using the table property "orc.row.index.stride". At 
a higher level, HIVE-5562 adds min/max metadata to stripe level. There is also 
file level min/max values as well at the file footer.

Stripe levels stats are stored in file footer, stripes that doesn't satisfy the 
predicates can be skipped while computing the splits. But for skipping at 
rowgroup level each stripe has to be read and kept in-memory. Since we read 
entire stripe to memory, I am not sure if adding additional level of min/max 
metadata (1 million rows) will be beneficial as skips happens in-memory. 

Both rowgroup elimination and stripe elimination will be turned on using "SET 
hive.optimize.index.filter=true;" hive config.

> Eliminate splits based on SARGs using stripe statistics in ORC
> --------------------------------------------------------------
>
>                 Key: HIVE-5632
>                 URL: https://issues.apache.org/jira/browse/HIVE-5632
>             Project: Hive
>          Issue Type: Improvement
>    Affects Versions: 0.13.0
>            Reporter: Prasanth J
>            Assignee: Prasanth J
>              Labels: orcfile
>         Attachments: HIVE-5632.1.patch.txt, HIVE-5632.2.patch.txt, 
> orc_split_elim.orc
>
>
> HIVE-5562 provides stripe level statistics in ORC. Stripe level statistics 
> combined with predicate pushdown in ORC (HIVE-4246) can be used to eliminate 
> the stripes (thereby splits) that doesn't satisfy the predicate condition. 
> This can greatly reduce unnecessary reads.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to