On Wed, Sep 11, 2019 at 11:06 AM Claude Warren <cla...@xenei.com> wrote:
> First it is important to remember that Bloom filters tell you where things > are NOT. Second it is important to understand that Bloom filters can give > false positives but never false negatives. Seems kind of pointless I know > but consider the case where you have 10K buckets that may contain the item > you are looking for. If you can reduce the number of buckets you are > searching you can significantly speed up the search. In a case like this a > bloom filter could be used "in front" of each bucket as a gatekeeper. When > ever an object goes in the bucket the objects bloom filter is added to the > bucket bloom filter. If you want to search the 10K buckets for an item > then you build the bloom filter for the item you are looking for and check > the bloom filter on each bucket. If the filter says that the item is not > in the bucket then you can skip that bucket, if the filter says it is in > the bucket you search the bucket to verify that it is not a false > positive. A common use for bloom filters is to determine if an expensive > call should be made. For example many browsers have a bloom filter that > comprises all the known bad URLs (ones that serve malware, etc). When the > URL is entered in the browser it is checked against the bloom filter. If > it is not there the request goes through as normal. If it is there then > the browser makes the expensive lookup call to a server to determine if the > URL really is in the database of bad URLs. > > So a bloom filter is generally used to front a collection to determine if > the collection should be searched. And as has been pointed out it doesn't > make much sense to use it in front of an in-memory hash table. However, > applications like Cassandra and Hadoop use bloom filters for various > reasons. I have recently been made aware of an Apache Incubator project > that wants to implement bloom filters as part of their project. Other uses > for bloom filters include sharding data. There is a measure of difference > between filters called a hamming distance. This is the number of bits that > have to be "flipped" to turn one filter into another, and is very similar > to Hamming measures found in string and other similar comparisons. By > using the hamming value it is possible to distribute data among a set of > buckets by simply putting the value in the bucket that it is "closest" to > in terms of Hamming distance. Searcing takes place as noted above. > However this has some interesting properties. For example you can add new > buckets at any time simply by adding an empty bucket and bloom filter to > the collection of buckets and the system will start filling the bucket as > appropriate. In addition if a bucket/shard becomes "full", where "full" is > an implementation dependent decision (e.g. the index on a DB table reaches > the inflection point where performance degradation begins), you can pull a > bucket out of consideration for inserts but still search it without > significant stress or change to the system. > > Internally Bloom filters are bit vectors. The length of the vector being > determined by the number of items that are to be placed in the bucket and > the acceptable hash collision rate. There is a function that will > calculate the length of the vector and the number of functions to use to > turn on the bits.[1] In general you build a bloom filter by creating a > hash and using the modulus of that to determine which bit in the vector to > turn on. You then furn a second hash, usually the same hash function with > a different seed to determine the next bit and so on until the number of > functions has been executed. Importantly, there are comments tin the > Cassandra code that describe a new and much faster way of doing this using > 128-bit hashes and splitting them into a pair of longs. To check > membership in a bloom filter you buid the bloom-filter for the target (T - > the thing we are looking for) and get the filter for the candidate (C - the > bucket) and evaluate T&C = T > if it evaluates as true there is a match if it not then T is guaranteed not > to be in the bucket. > > There are several of us at Apache that work on bloom filters and we have > been unable to locate an open source library that is under the Apache or > similar license. I have done work on a concept call a proto-bloom filter > that does the hashing early and then makes it faster to generated concrete > bloom filters of various sizes, thus enabling a more efficient layering of > filters. > > Several of us have done research into ways to index filters so that if you > have a collection of filters you can quickly locate the candidates. This > is not as simple as it sounds due to the way in which filters are checked > and the issues with over filled filters yielding high false positive > counts, in addition the check is so fast that the over head for most > indexing eats up any increase in speed. My research has shown that for > filter collections of less than 1000 it is always faster to do a linear > search through an array than any other means. Above 1000 entries there are > techniques that can yield faster evaluation in some cases. > > The long and short of this is that there is no good unencumbered open > source library available at the current time. Myself and several others, > in conversation here at ApacheCon, have expressed interest in creating such > a library. We have fairly mature code that we are willing to contribute > along with code that embodies new thinking in the bloom filter arena (like > proto-bloom filters). We just need a space within the Apache family to > host it. The code base seems to small to be a separate project and so we > come to Apache Commons seeking a home. > > Claude > Hi Claude, Thank you for the explainer :-) quite helpful. Gary > > > > > [1] https://hur.st/bloomfilter/ > > On Wed, Sep 11, 2019 at 12:36 PM Gary Gregory <garydgreg...@gmail.com> > wrote: > > > I would like to know more. I am curious since looking up whether an > element > > is in a set is done via a hash code. How do you do better than that? > > > > Gary > > > > On Tue, Sep 10, 2019, 16:51 Bruno P. Kinoshita <ki...@apache.org> wrote: > > > > > +1 Collections sounds like a good place for a bloom filter. > > > > > > Bruno > > > > > > On Wednesday, 11 September 2019, 8:00:45 am NZST, Jochen Wiedmann < > > > jochen.wiedm...@gmail.com> wrote: > > > > > > Hi, Claude, > > > > > > having read, what a bloom filter is, a subproject sounds unnecessary > > > to me. I'd recommend, that you contribute your code to Commons > > > Collections, which seems to me to be a logical target. > > > > > > Jochen > > > > > > On Tue, Sep 10, 2019 at 8:45 PM Claude Warren <cla...@xenei.com> > wrote: > > > > > > > > Having spoken with several people at ApacheCon, I would like to see a > > > > bloomfilter sub project. I have code that is already under Apache > > > License > > > > that I am willing to contribute as the basis The goal of the > > sub-project > > > > would be to produce a reference implementation that could be used by > > > other > > > > projects that desire to have use bloom filters and bloom filter based > > > > collections. > > > > > > > > Is there any objection to doing this? Other than asking here, what > is > > > the > > > > proper path to get a sub-project created, What does the Commons PMC > > > > require? > > > > > > > > Any assistance and comments would be apprecieated. > > > > Claude > > > > > > > > -- > > > > I like: Like Like - The likeliest place on the web > > > > <http://like-like.xenei.com> > > > > LinkedIn: http://www.linkedin.com/in/claudewarren > > > > > > --------------------------------------------------------------------- > > > To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org > > > For additional commands, e-mail: dev-h...@commons.apache.org > > > > > > > > > > > -- > I like: Like Like - The likeliest place on the web > <http://like-like.xenei.com> > LinkedIn: http://www.linkedin.com/in/claudewarren >