A Bloom filter is a set of bits: "The hash area is considered as N individual addressable bits, with addresses 0 through N - 1. It is assumed that all bits in the hash area are first set to 0. Next, each mes- sage in the set to be stored is hash coded into a number of distinct bit addresses, say al, a2, ..., ad. Finally, all d bits addressed by al through ad are set to 1." -- Burton Bloom, "Space/Time Trade-offs in Hash Coding with Allowable Errors"[1]
Claude [1] http://crystal.uta.edu/~mcguigan/cse6350/papers/Bloom.pdf On Sat, Oct 19, 2019 at 11:40 PM Claude Warren <cla...@xenei.com> wrote: > As an experiment I attempted to implement a complete BloomFilter (see > BloomFilterI2 in https://issues.apache.org/jira/browse/COLLECTIONS-728) > using only IntStream and merge( BloomFilter ). It is possible, though > probably not the fastest implementation. > > We could define something like: > ByteBuffer getBits() > > or a method that returns a stream of Boolean (this will be very long in > processing). > or a method that returns a stream of ByteBuffers (the concatenation of > which is the entire set of bits) > or a method that returns an IntStream (the concatenation of which is the > entire set of bits). > > I think that most implementations will have some internal structure that > can generate the above. The wider the item in the stream the faster we can > move the bits, the better the average performance will be. > > > Claude > > > > > On Sat, Oct 19, 2019 at 4:48 PM Gilles Sadowski <gillese...@gmail.com> > wrote: > >> Hello. >> >> 2019-10-19 16:20 UTC+02:00, Claude Warren <cla...@xenei.com>: >> > On Fri, Oct 18, 2019 at 3:22 PM Gilles Sadowski <gillese...@gmail.com> >> > wrote: >> > >> >> Hi. >> >> >> >> >>> [...] >> >> > > >> >> > > Maybe I was not clear enough: I'm not saying that we should prefer >> >> > > some representation (of the state) over another; only that the how >> >> > > the state is represented externally need not be the same as the >> >> internal >> >> > > representation. >> >> > > >> >> > > But if the state is fully encapsulated by a "BitSet", then internal >> >> > > and >> >> > > external can be the same, making easier both for developers and for >> >> > > users. So let's have indeed a "getBitSet()" that returns a copy of >> >> > > the >> >> > > (private final) instance field (of type "BitSet") used to represent >> >> > > the >> >> > > state of a "BloomFilter". >> >> > > >> >> > > >> >> > I agree with most of this statement. However, I think the use of >> BitSet >> >> for >> >> > the internal representation is questionable. >> >> >> >> Hence, that settles the issue of "class" versus "interface": >> >> "BloomFilter" >> >> must be an interface. >> >> >> >> *I am still not convinced. How do you then provide consistent >> >> implementations of the distance functions? We could use “default” >> >> implementations but that seems like a misuse of the intent for >> “default”, >> >> though perhaps I misunderstand that.* >> >> What's the issue with having "distance(BloomFilter other)" >> in the interface? >> >> >> >> >> >> >> [snip] >> > >> >> > In addition, there are researchers that do not use >> >> > BitSet because it is too slow or too bloated. >> >> >> >> Important point. >> >> [I thought that "BitSet" was evoked because it perfectly fitted the >> >> bill.] >> >> >> > >> > BitSet fits the bill in terms of required data read functionality. >> >> Yes; but not in terms of performance for certain use cases. >> Hence why it cannot be a common denominator for all >> implementations. >> >> > >> > [snip] >> > >> >> > We can use the >> >> > Implementation in our classes and use the interface where possible >> but >> >> > it >> >> > leaves open the possibility of switching in a version of >> RoaringBitmap >> >> > or >> >> > similar without changing the external API. It is this interface that >> is >> >> the >> >> > “external reference implementation of state”. I think I would like to >> >> call >> >> > this interface BitSetI and will refer to it this way later. I think >> it >> >> > should contain the following methods as defined in BitSet: and(), >> >> andNot(), >> >> > cardinality(), get(), intersects(), isEmpty(), nextClearBit(), >> >> > nextSetBit(), or(), previousClearBit(), previousSetBit(), stream(), >> >> xor(). >> >> >> >> IMHO, this proposal orthogonal to the "BloomFilter" functionality, and >> >> we must discuss separately whether [Collections] wants to host such an >> >> API. >> >> >> > >> > I think a core question needs to be answered: Is a Bloom filter an >> > extension of functionality on a BitSet (isA) or does it simply use the >> > functionality of a BitSet (hasA). >> >> AFAICT, neither. >> >> > If find myself of two minds on this >> > quesiton at the start, but seem to come down on the isA side after >> analysis >> > of how the Bloom filter is used. To that end BloomFilter should provide >> the >> > data read methods of the BitSet functionality. What I called BitSetI >> > before. >> >> "data read" and "BitSetI" are quite different. >> As said before, the former is implementation-dependent and a >> user's business (as long as the developers provide a suitable >> accessor of course). >> >> >> >> >> A priori, how the classes that implements the "BloomFilter" interface >> >> represents the "set of bits" is an implementation detail. Even that >> >> there >> >> is a "set of bits" is a detail to *users* of the interface. >> >> Of course, a practical question is whether the representation should be >> >> "pluggable". If not, we could change to using one or another from >> those >> >> classes that you mentioned (and change it from version to version). >> >> In any case, I think that this would be a feature to be left for later >> >> (and >> >> that could be added in a compatible way, if nothing of the "state" is >> >> leaked >> >> into the public API). >> >> >> > >> > To my mind the “set of bits” interface is what I called BitSetI. Again, >> > this defines how the Bloom filter presents itself as a “set of bits” not >> > how it implements it. >> >> IMO, the "BitSetI"/"set of bits" is an implementation *detail*. >> IOW: a "BloomFilter" is *not* a "set of bits" but *can* be represented >> as one. >> >> Under the hood, it of course needs some kind of "set of bits" to perform >> its operations (i.e. those methods that defines the public API), but that >> certainly does not mean that an application developer would be well >> advised to use a "BloomFilter" in order to represent, say, a number >> (just because a number could also be viewed as a "set of bits")... >> >> > >> > [snip] >> > >> >> > The implementation may not want to use the BitSet class because the >> >> filter >> >> > itself can be large. For example in DNA matching it is not uncommon >> for >> >> the >> >> > Bloom filter to be 100s of MB in size. This also means we don’t want >> to >> >> > make copies if we don’t have to. (perhaps I should add DNA matching >> to >> >> the >> >> > Use cases to capture the large number of bits). >> >> >> >> So this use case would entail that the "BloomFilter" interface must >> *not* >> >> mandate a "toBitSet()" method. >> >> >> > >> > Agreed. >> > >> > >> >> > > [...] >> >> > > >> >> > >> >> > I agree with most of this statement. I still don’t like forcing a >> >> standard >> >> > BitSet implementation onto all instances of Bloom filter >> >> >> >> +1 >> >> >> >> > when it is the >> >> > logical construct (BitSetI) that we are looking for. >> >> >> >> -1 >> >> [IIUC; see my comments above.] >> >> >> > >> > Somehow I think we are in violent agreement here, but perhaps are >> talking >> > past each other some how. >> >> Yes, and no. We are indeed converging, but I more and more >> convinced that the concept of "BloomFilter" should not be based >> on how we think serialization should work or how to practically >> implement the API. >> >> We must start from the "outside view" of the API, i.e. define what >> services the "BloomFilter" must provide. >> >> >> > > >> >> > > > There are several other examples such as layered Bloom >> >> > > > filters ( >> >> > > https://en.wikipedia.org/wiki/Bloom_filter#Layered_Bloom_filters) >> >> > > > and attenuated Bloom filters ( >> >> > > > >> https://en.wikipedia.org/wiki/Bloom_filter#Attenuated_Bloom_filters) >> >> > > where >> >> > > > the internal structure deviates from the BitSet model but that >> >> > > > still >> >> need >> >> > > > to present themselves as AbstractBloomFilters do. I think that >> >> Abstract >> >> > > > class is the correct pattern here. >> >> > > >> >> > > Not sure: "abstract" is adequate when several subclasses need to >> >> > > override behaviours; whereas, here the behaviour is the same but >> >> > > the state is expanded. >> >> > > >> >> > > >> >> > I think abstract because the methods to perform the calculations that >> >> > are >> >> > currently in BloomFilterFunctions would be implementable against >> >> > BitSetI. >> >> > Those are the functions that go into the abstract class (along with >> >> BitSetI >> >> > methods). >> >> >> >> Ditto: these are details from the POV of "BloomFilter" users; of >> course, >> >> we need code that will do the work on the whatever represents the "sets >> >> of bits", but at first sight I don't think that we need to advertise a >> >> "BitSetI" >> >> (even a "protected" one) in an abstract class, because it's unlikely >> that >> >> any and all existing external implementations of "set of bits" will >> >> comply >> >> with an interface defined here. >> >> >> > Reading through various implementations of Bloom filters as well as >> > papers >> > describeing new types of Bloom filters and ways to index them and ways >> to >> > extract data from them there are a set of low level bit operations (xor, >> > or, cardinality, and, cardinality of the results of xor, or and and) >> that >> > if supported make all that work possible. I will create a separate >> document >> > that outlines what those are so that we can discuss them in detail. >> >> You are certainly more aware than I am of all the "services" that >> could be necessary in order to care for all the interesting use cases. >> But how about focusing on the *minimal* API first? >> What I suggest is to start from what you initially presented as the >> "BloomFilter" interface ("match", "merge", etc.), and make that >> work based on the layout ideas contained the preliminary Java code >> posted on JIRA (issue COLLECTIONS-728). >> >> > >> > [snip] >> > >> >> >> >> > I don’t know of any such usage, however, nesting Bloom filters is >> done >> >> > in >> >> > Attenuated filters and Layered Bloom filters use a number of Bloom >> >> > filter >> >> > implementations internally. >> >> >> >> What do those use cases require from the "BloomFilter" interface? >> >> >> > >> > What those implementations require are the standard match() and merge() >> as >> > well as some love level bit operations (see aboce). I will add this >> > analysis to the separate document I committed to above. >> >> As per my preceding comment, I think that we should make the >> basics as simple as simple, and show that it delivers the intended >> functionality. >> Then, more advanced use cases might well require addition to the >> API, but it will be clear for everyone why the are needed (based on >> said use-cases). >> In effect, maybe some of the additional requirements could be >> defined in an *extension* of the "BloomFilter" interface; a >> "MutableBloomFilter" interface naturally comes to mind. >> From what you said above, there might also be >> ---CUT--- >> interface BloomFilterBitSetOperations extends BloomFilter { >> BloomFilterBitSetOperations and(BloomFilterBitSetOperations other); >> BloomFilterBitSetOperations xor(BloomFilterBitSetOperations other); >> } >> ---CUT--- >> >> Some other functionalities (mostly the recurrent issue of serialization) >> might be readily solved at the level of specific implementations; e.g. >> the "BitSetBloomFilter" which I referred to before will certainly >> provide "getBitSet()" and "fromBitSet(BitSet)" methods that are >> sufficient for moving instances around. >> >> >> > > > [...] >> >> > > >> >> > Number of bits is not implicitly conveyed by getBitSet() as the >> BitSet >> >> will >> >> > only grow to the size needed to capture the highest bit currently on. >> >> > So >> >> > the number of bits may be 72 but the highest on bit might be 50. >> >> >> >> OK; doesn't matter now since a common "BitSet" state has been >> >> ruled out. >> >> I get that "numberOfBits()" is one of the methods defined by >> >> the "BloomFilter" interface. >> >> >> >> > >> >> > Hash, number of functions (k), number of items (n), and number of >> bits >> >> (m) >> >> > define the system in which the filter is expressed, call this the >> >> “Shape”. >> >> > It is not always necessary to convey the Shape when conveying the >> bloom >> >> > filter. >> >> >> >> Well, if it is *ever* necessary, then "BloomFilter" must declare a >> >> "shape()" >> >> method. Hence, it could be worth defining a "Shape" subinterface (?). >> >> >> >> >> > See other discussion thread on shape. >> >> I've used it to arrive to the "BloomFilter.java" proposal attached >> on JIRA. >> >> > >> > [snip] >> > >> >> > >> >> > > I'm lost here; I don't know what is "the" reference state, since a >> >> > > single structure cannot for example represent both a standard >> >> > > filter and a counting filter... >> >> > > >> >> > > >> >> > Think BitSetI. >> >> >> >> Still: No. >> >> IMHO. ;-) >> >> >> > I meant that the bit set functions defined by BitSetI define the state >> of >> > a standard Bloom filter. Since all Bloom filter types are and >> > implementation of the concept of Bloom filter, all of them must be able >> to >> > provide the data requested from the BitSetI functionality. >> >> I get the rationale, but I'm not able to see that in the textual >> description of use cases. >> >> Please note that I wish to avoid "preemptive" bloating of the API, >> which, after the fact, is nearly impossible to fix in "Commons" >> codes. >> [Among other problems, this ever increasing size of the codebase, >> despite the identified issues, is what led to "Commons Math" >> unfortunate fate: specific case in point in the matrix functionality >> there, whose bloated interface discouraged potential contributions >> targeting at 3D geometry.] >> >> >> IIUC, for a Bloom filter, >> >> "state" = "shape" + "set of bits" >> >> >> >> *See separate shape discussion.* >> > >> > >> >> > > > >> >> > > >> > In addition, the abstract Bloom filter might implement the >> “data >> >> > > >> > read” methods from BitSet directly. This would allow more >> >> efficient >> >> > > >> > implementation of several methods that currently depend upon >> the >> >> > > >> > creation >> >> > > >> > of a BitSet and the manipulation of same. >> >> > > >> >> >> > > >> Sure; if any and all variants of a "BloomFilter" implementation >> >> > > >> can >> >> > > >> be created from a "BitSet", then such a factory method could be >> >> > > >> part of the "Builder" interface: >> >> > > >> >> >> > > >> public interface BloomFilter { >> >> > > >> // Core methods... >> >> > > >> >> >> > > >> public interface Builder { >> >> > > >> public Builder from(BitSet s); // Factory method. >> >> >> >> As above, for large filters, this probably is not a good requirement. >> >> Should it be >> >> public Builder from(BooleanStream s); >> >> ? >> >> ["BooleanStream" is not provided by the JDK...] >> >> >> > >> > BooleanStream is not but IntStream is, in this case we define the >> integer >> > stream to be the indexes of the bits that are on in the filter. I have >> > seen this in some implementations. >> >> Maybe not the most efficient way to convey the state if more >> than 1/32 of the bits are set. >> >> >> > > >> public BloomFilter build(); >> >> > > >> // Other builder methods. >> >> > > >> } >> >> > > >> } >> >> > > >> >> >> >> > >> > [snip] >> > >> > >> >> > One final issue. There was recently a post on the pull request asking >> >> > why >> >> > the Bloom filter is an immutable object. We should probably explore >> why >> >> or >> >> > why not. >> >> >> >> Part of the answer is universal: because it is trivial to ensure a >> >> consistent >> >> state (even in the face of concurrent usage). >> >> The other part of is the justification, which you gave on GitHub, based >> >> on >> >> expected usage. >> >> >> >> > The biggest argument against is that the merge() method would >> requires >> >> that >> >> > the data be duplicated and it makes it hard to do in place updates. >> For >> >> > example when updating a “bucket” filter (as defined in the Use >> cases). >> >> >> >> Now that "BloomFilter" is going to be a interface, some of its >> >> implementations >> >> could be mutable (e.g. if it would be a requirement for its target >> >> usage). >> >> >> >> However, arguments about performance should be backed with >> benchmarks... >> >> >> >> In general, I think that code will be more robust if low-level classes >> >> are >> >> immutable; and higher-level is responsible for managing instances (i.e. >> >> replacing them if outdated). >> >> >> >> > But if it is not immutable then the hashCode value changes making it >> >> > unsuitable for Hash sets. I think that there are classes in Commons >> >> > that >> >> > have similar restrictions, and I think that documenting the issue in >> >> > the >> >> > class javadoc as well as the hashCode() and equals() (assuming we >> >> actually >> >> > override them) would be acceptable. >> >> >> >> Are there use cases where Bloom filters would be stored in a "HashSet"? >> >> >> >> >> > I had a use case in an index implementation but it can be worked around. >> > Having the filter be mutable seem to be a common implementation. I am >> > leaning in this direction now because immutability makes some usage more >> > compex without a strong benefit. >> >> I'd tend to disagree, especially if the simple/basic implementations >> have some uses in multi-threaded scenarios. >> In particular, the copying of small objects will probably be more >> efficient >> than mutating them. >> >> > But I can go either way. >> >> With "BloomFilter" as an interface, we can go both ways: just add a >> new implementation when a use-cases justifies it (through benchmarks). >> >> Best regards, >> Gilles >> >> > >> > All the best, >> > Claude >> > >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org >> For additional commands, e-mail: dev-h...@commons.apache.org >> >> > > -- > I like: Like Like - The likeliest place on the web > <http://like-like.xenei.com> > LinkedIn: http://www.linkedin.com/in/claudewarren > -- I like: Like Like - The likeliest place on the web <http://like-like.xenei.com> LinkedIn: http://www.linkedin.com/in/claudewarren