[ 
https://issues.apache.org/jira/browse/HIVE-5632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13809810#comment-13809810
 ] 

Prasanth J commented on HIVE-5632:
----------------------------------

[~ehans] Row groups (10,000 rows) level skipping is already implemented as part 
of PPD. This patch adds stripe-level skipping. With this patch, stripes will 
NOT be read if its min/max metadata prunes it. 

To make it more clear. OrcInputFormat creates input splits based on the 
following map reduce configs mapred.min.split.size and mapred.max.split.size. 
The default mapred.min.split.size is 16MB and default mapred.max.split.size is 
256MB. If an orc stripe is smaller than mapred.max.split.size then it will be 
merged with adjacent orc stripe. Multiple orc stripes are merged until 
mapred.max.split.size is reached. So a split can have more than one orc stripe. 
Now, before merging the stripes to a split, this patch will check if min/max 
conditions are met. If the condition is met, stripes will be merged to form a 
split else it will eliminate the stripe and will start a new split. The final 
list of input splits will be submitted for execution which makes sure byte 
ranges (essentially orc stripes) that are not required are not read.

> Eliminate splits based on SARGs using stripe statistics in ORC
> --------------------------------------------------------------
>
>                 Key: HIVE-5632
>                 URL: https://issues.apache.org/jira/browse/HIVE-5632
>             Project: Hive
>          Issue Type: Improvement
>    Affects Versions: 0.13.0
>            Reporter: Prasanth J
>            Assignee: Prasanth J
>              Labels: orcfile
>         Attachments: HIVE-5632.1.patch.txt, HIVE-5632.2.patch.txt, 
> orc_split_elim.orc
>
>
> HIVE-5562 provides stripe level statistics in ORC. Stripe level statistics 
> combined with predicate pushdown in ORC (HIVE-4246) can be used to eliminate 
> the stripes (thereby splits) that doesn't satisfy the predicate condition. 
> This can greatly reduce unnecessary reads.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to