[ https://issues.apache.org/jira/browse/HIVE-5632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13809810#comment-13809810 ]
Prasanth J commented on HIVE-5632: ---------------------------------- [~ehans] Row groups (10,000 rows) level skipping is already implemented as part of PPD. This patch adds stripe-level skipping. With this patch, stripes will NOT be read if its min/max metadata prunes it. To make it more clear. OrcInputFormat creates input splits based on the following map reduce configs mapred.min.split.size and mapred.max.split.size. The default mapred.min.split.size is 16MB and default mapred.max.split.size is 256MB. If an orc stripe is smaller than mapred.max.split.size then it will be merged with adjacent orc stripe. Multiple orc stripes are merged until mapred.max.split.size is reached. So a split can have more than one orc stripe. Now, before merging the stripes to a split, this patch will check if min/max conditions are met. If the condition is met, stripes will be merged to form a split else it will eliminate the stripe and will start a new split. The final list of input splits will be submitted for execution which makes sure byte ranges (essentially orc stripes) that are not required are not read. > Eliminate splits based on SARGs using stripe statistics in ORC > -------------------------------------------------------------- > > Key: HIVE-5632 > URL: https://issues.apache.org/jira/browse/HIVE-5632 > Project: Hive > Issue Type: Improvement > Affects Versions: 0.13.0 > Reporter: Prasanth J > Assignee: Prasanth J > Labels: orcfile > Attachments: HIVE-5632.1.patch.txt, HIVE-5632.2.patch.txt, > orc_split_elim.orc > > > HIVE-5562 provides stripe level statistics in ORC. Stripe level statistics > combined with predicate pushdown in ORC (HIVE-4246) can be used to eliminate > the stripes (thereby splits) that doesn't satisfy the predicate condition. > This can greatly reduce unnecessary reads. -- This message was sent by Atlassian JIRA (v6.1#6144)