[ 
https://issues.apache.org/jira/browse/HIVE-9188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14268632#comment-14268632
 ] 

Prasanth Jayachandran commented on HIVE-9188:
---------------------------------------------

[~owen.omalley] Current patch has bloom filters at all 3 levels. The size is 
kept constant for all 3 levels. But fpp for stripe will be >0.05 (assuming >10k 
unique items) and for file it will be much worse. With this we will get good 
row group elimination and considerably good stripe elimination. I can drop the 
file level bloom filter which we don't use for any purpose.

The merging of disk ranges happens after we pick the row groups that satisfy 
the SARG (readPartialDataStreams() happens after pickRowGroups()). But we need 
bloom filter before that for eliminating row groups.

> BloomFilter in ORC row group index
> ----------------------------------
>
>                 Key: HIVE-9188
>                 URL: https://issues.apache.org/jira/browse/HIVE-9188
>             Project: Hive
>          Issue Type: New Feature
>          Components: File Formats
>    Affects Versions: 0.15.0
>            Reporter: Prasanth Jayachandran
>            Assignee: Prasanth Jayachandran
>              Labels: orcfile
>         Attachments: HIVE-9188.1.patch, HIVE-9188.2.patch, HIVE-9188.3.patch, 
> HIVE-9188.4.patch
>
>
> BloomFilters are well known probabilistic data structure for set membership 
> checking. We can use bloom filters in ORC index for better row group pruning. 
> Currently, ORC row group index uses min/max statistics to eliminate row 
> groups (stripes as well) that do not satisfy predicate condition specified in 
> the query. But in some cases, the efficiency of min/max based elimination is 
> not optimal (unsorted columns with wide range of entries). Bloom filters can 
> be an effective and efficient alternative for row group/split elimination for 
> point queries or queries with IN clause.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to