[ https://issues.apache.org/jira/browse/HIVE-1721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Siddhartha Gunda updated HIVE-1721: ----------------------------------- Attachment: hive-1721.patch.txt I created some UDF and UDAF functions using which we can create bloom filters and also use it. Sample Ways to use them- STEP 1 : CREATE TEMPORARY FUNCTION bloom AS 'org.apache.hadoop.hive.contrib.genericudaf.GenericUDAFBuildBloom'; STEP 2 : CREATE TEMPORARY FUNCTION bloom_filter AS 'org.apache.hadoop.hive.contrib.genericudf.GenericUDFBloomFilter'; STEP 3 : CREATE TABLE 'NameOfBloomFilterTable' as SELECT bloom('HashType', 'NumElements', 'ProbabilityOfFalsePositives',column1,column2,……) FROM 'TableName'; 'NameOfBloomFilterTable' - Give a name to the table in which bloom filter is stored. 'HashType' - Type of hash functions used to build the bloom filter. Its accepts two inputs, 'jenkins', 'murmur' 'NumElements' - Number of elements in the table on which the bloom filter is being built 'ProbabilityOfFalsePositives' - acceptable probability of false positives. Example : CREATE TABLE tblBloom as SELECT bloom('jenkins', '20', '0.1',id,str) FROM tblOne; STEP 4 : ADD FILE 'PathOfBloomFilterTable'; Example : ADD FILE /user/hive/warehouse/tblbloom40/000000_0; STEP 5 : Sample Use cases SELECT *,bloom_filter('jenkins', '20', '0.1', '000000_0', id, str) FROM Table1; SELECT * FROM Table1 INNER JOIN Table2 ON Table1.id = Table2.id WHERE bloom_filter('jenkins', '20', '0.1', '000000_0', Table1.id, Table1.str) > use bloom filters to improve the performance of joins > ----------------------------------------------------- > > Key: HIVE-1721 > URL: https://issues.apache.org/jira/browse/HIVE-1721 > Project: Hive > Issue Type: New Feature > Components: Query Processor > Reporter: Namit Jain > Labels: gsoc, gsoc2012, optimization > Attachments: hive-1721.patch.txt > > > In case of map-joins, it is likely that the big table will not find many > matching rows from the small table. > Currently, we perform a hash-map lookup for every row in the big table, which > can be pretty expensive. > It might be useful to try out a bloom-filter containing all the elements in > the small table. > Each element from the big table is first searched in the bloom filter, and > only in case of a positive match, > the small table hash table is explored. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira