[ 
https://issues.apache.org/jira/browse/HIVE-1721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12922154#action_12922154
 ] 

Joydeep Sen Sarma commented on HIVE-1721:
-----------------------------------------

i am not so sure about this.

consider a hash table which has a very large number of buckets (relative to the 
size of the elements in the hashtable).  a lookup inside the hashtable stops as 
soon as we hit an empty bucket. this requires us to only compute the 
hashcode(). if #buckets >> #elements - then for a miss - the likely average 
cost of the miss should only be the cost of the hashcode routine.

now consider a bloom filter. here we have to compute multiple hash codes (or at 
least one). on top of that - with the added bloom filters - there's an added 
cost for each positive (many hashcode computations). 

It's very clear from this reasoning that Bloom filters would be more expensive, 
not less, for small table joins. note that the hashtables in java do allow 
specification of number of buckets - so the strategy outlined here (of 
deliberately constructing a sparse hash table) is a feasible one.

Stepping back - this makes sense - because Bloom filters are designed for large 
data sets (or at least data sets that don't easily fit in memory) - not small 
ones (that fit easily in memory).

---

It would be more interesting to consider Bloom filters to cover join scenarios 
that cannot be performed with map join. for example - if the small table had 1M 
keys and map-join is not able to handle that large a hash table - then one can 
use bloom filters:

- filter (probabilistically) large table against medium sized table by looking 
up against bloom filter of medium-sized table (map-side bloom filter). (Note - 
this is not a join - just a filter)
- take filtered output and do sort-merge join against medium sized table (by 
now the data size should be greatly reduced and the cost of sorting would go 
down tremendously).

there's lots of literature around this - it's a pretty well known technique. 
it's quite different from what's proposed in this jira.

> use bloom filters to improve the performance of map joins
> ---------------------------------------------------------
>
>                 Key: HIVE-1721
>                 URL: https://issues.apache.org/jira/browse/HIVE-1721
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: Liyin Tang
>
> In case of map-joins, it is likely that the big table will not find many 
> matching rows from the small table.
> Currently, we perform a hash-map lookup for every row in the big table, which 
> can be pretty expensive.
> It might be useful to try out a bloom-filter containing all the elements in 
> the small table.
> Each element from the big table is first searched in the bloom filter, and 
> only in case of a positive match,
> the small table hash table is explored.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to