[ https://issues.apache.org/jira/browse/HADOOP-11829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hongbo Xu resolved HADOOP-11829. -------------------------------- Resolution: Invalid > Improve the vector size of Bloom Filter from int to long, and storage from > memory to disk > ----------------------------------------------------------------------------------------- > > Key: HADOOP-11829 > URL: https://issues.apache.org/jira/browse/HADOOP-11829 > Project: Hadoop Common > Issue Type: Improvement > Components: util > Reporter: Hongbo Xu > Assignee: Hongbo Xu > Priority: Minor > Original Estimate: 168h > Remaining Estimate: 168h > > org.apache.hadoop.util.bloom.BloomFilter(int vectorSize, int nbHash, int > hashType) > This filter almost can insert 900 million objects, when False Positives > Probability is 0.0001, and it needs 2.1G RAM. > In My project, I needs established a filter which capacity is 2 billion, and > it needs 4.7G RAM, the vector size is 38340233509, out the range of int, and > I does not have so much RAM to do this, so I rebuild a big bloom filter which > vector size type is long, and split the bit data to some files on disk, then > distribute files to work node, and the performance is very good. > I think I can contribute this code to Hadoop Common, and a 128-bit Hash > function (MurmurHash) -- This message was sent by Atlassian JIRA (v6.3.4#6332)