I have added a list of Use Cases ( https://github.com/Claudenw/commons-collections/blob/master/src/main/java/org/apache/commons/collections4/bloomfilter/Usage.md) I expect that these will end up in documentation somewhere.
On Wed, Oct 16, 2019 at 2:08 AM Gilles Sadowski <gillese...@gmail.com> wrote: > Hi. > > 2019-10-15 20:05 UTC+02:00, Claude Warren <cla...@xenei.com>: > > On Tue, Oct 15, 2019 at 1:46 AM Gilles Sadowski <gillese...@gmail.com> > > wrote: > > > >> Hello. > >> > >> > >> > >> > Furthermore, > >> > the other potential users and supporters have not responded to any > >> > communication about this proposal so I am floundering on that front > >> > too. > >> > >> Who are they? > >> > > > > Developers I have worked with or know of that have expertice utilizing > and > > analyzing Bloom filters in various projects and papers. Apache developers > > that are considering adding Bloom filters to their projects. > > For sure, it would clear the path (to a design) if more people > would lay out (usage) requirements. > > > > >> > >> > What I want to do is take a step back. Let’s remove the collection > >> packages > >> > from consideration and work out the base package first. > >> > >> +1 > >> > >> > With that in mind I > >> > would like to present reasoning behind the classes as they now stand > >> > and > >> > suggest a way forward to get the core classes in before adding the > >> > collections. > >> > >> For sure, I appreciate your willingness to provide more explanations. > :-) > >> > >> > As discussed earlier Bloom filters are a concept. In most peoples > >> > mind’s > >> > they are bit vectors with specific named functions (e.g. match => A > and > >> B = > >> > A, hammingValue => cardinality(this)). This idea of the filter as a > bit > >> > vector led me to using a BitSet. Originally the BloomFilter was > >> > serializable but valid arguments against that and toward making the > >> > internal representation visible enought to serialize that lead to > >> > making > >> > the BitSet public. > >> > >> Maybe I'm missing your point, but I don't see a direct connection > >> between persistence of a "BloomFillter" instance and the visibility > >> of "BitSet" (if that is used to represent its internal state). > >> > >> Realistically, when we “persist” a Bloom filter we are preserving the > > internal state. If the Bloom filter itself is not serializable then the > > internal state has to be preserved somehow. You made a comment on Sep 23, > > 2019 when we were discussing the serialization on the “New Sub-project > > Proposal.” topic in which you said “At first sight, I'd say that > > serialization is out-of-scope (we should let application developers deal > > with that using the available accessors).” Perhaps I have misunderstood > but > > when I read that a light went on. What has to happen is that enough of > the > > internal state has to be exposed for the developer to preserve that state > > and be able to reinstantiate it later. As the state is really an bit > array > > I figured the most logical thing would be to expose the BitSet. > > Alternatively we could expose a ByteBuffer or any of a number of > structures > > that can represent a bit array. However, the BitSet is, semantically > > speaking, what we are exposing. > > Maybe I was not clear enough: I'm not saying that we should prefer > some representation (of the state) over another; only that the how > the state is represented externally need not be the same as the internal > representation. > > But if the state is fully encapsulated by a "BitSet", then internal and > external can be the same, making easier both for developers and for > users. So let's have indeed a "getBitSet()" that returns a copy of the > (private final) instance field (of type "BitSet") used to represent the > state of a "BloomFilter". > > >> > In addition to the normal Bloom filter methods > >> > (hammingValue, match, etc.) there are some methods (log, cosine > >> > distance, > >> > jaccard distance, etc.) that are used by researchers and other > >> > attempting > >> > to do some work with indexing and such for large collections of > >> > filters. > >> > >> IIUC, you indeed distinguish between core functionality (i.e. "normal > >> Bloom filter methods") and additional support for (?)... > >> If so, "core" should indeed be implemented first, with maximum > >> encapsulation (even if it would seem to prevent "research" usage). > >> > >> Indeed I distinguish. I distinguish to align various functionality with > > external documentation. For example in Bloom’s paper he does not talk > about > > Hamming values a term and concept that is found in every Bloom filter > > implementation I have seen and which is frequent in the literature. So a > > “Strict” implementation of a Bloom filter might only have a single method > > “match()” and still be considered an implementation of a Bloom filter, > but > > it would not be of much use in modern systems. Realistically, there are a > > number of methods that are commonly found in Bloom filter > implementations. > > I believe that at a minimum they are “match()”, “merge()” and > > “hammingValue()”. > > No problem with that: API should thus contain those instance methods. > That's the kind of information we need from users of the functionality. > > >> > Finally there is the question of “normal” usage. In his paper on > >> > “Compressed Bloom Filters”[1] (an implementation we do not have yet) > >> > Michael Mitzenmacher states “the Bloom filter plays a dual role. It is > >> both > >> > a data structure being used at the proxies, and a message being passed > >> > between them”. He is discussing the issue of using bloom filters in a > >> > distributed cache where the proxies distributed their bloom filter so > >> that > >> > all the proxies know which proxy might have the document. So the Bloom > >> > filter needs both an internal representation and a mechanism to pass > >> > the > >> > data between systems that use it. It seems obvious that the BitSet > >> > works > >> as > >> > the mechanism to pass the data between systems, though an immutable > >> version > >> > would be preferred. > >> > >> So it seems we can have a Bloom filter state represented by a > >> "private" and "final" instance of "BitSet". Is there any reason that > >> the implementation in [Collections] would ever need to use several > >> representations of its state? > >> If not, I'd think that "BloomFilter" can be a concrete class (with, for > >> now, a "BitSet" as its state). > >> > >> The counter examples that come to mind is not within a single Bloom > >> filter > > type. A counting bloom filter tracks the number of times a bit has been > set > > so that it can do deletions (something standard Bloom filters have a hard > > time with). The implementation in the code base simply extends the > > StandardBloomFilter implementation and adds a sparse array of Integers as > > the counts. > > So, in my understanding, it is an extension of the above "BitSet" > state, and thus a valid case for inheriting from "BloomFilter" (as > a concrete, not "abstract", class. > > > However, the counting Bloom filter does not need the BitSet > > from StandardBloomFilter. It can generate one on demand. > > It can, but why would it reimplement the "getBitSet()" method, > (and more) rather than just delegate to its base class (i.e. no > duplication at all in the OO paradigm)? > > > The counting Bloom > > filter needs to be able to determine if an AbstractBloomFilter matches > its > > pattern. It can do this internally without using a BitSet by scanning the > > sparse array (though a bitset might prove faster). Finally the counting > > Bloom filter can be merged into another bloom filter just like any other > > AbstractBloomFilter can. It does that by basically turning on the bits in > > the AbstractBloomFilter that are indexes for the entries in the internal > > sparse array. > > IIUC "match" does not need any override, while "merge" will > handle the sparse array, and then call "super.merge". In a > "CountingBloomFilter", there will be two "merge", and it can be > construed as redundant, but I suppose that "merge" is called > much less often than "match"; hence it could be an acceptable > compromise for not having duplicate implementations (one > with a "BitSet" and one with an array of integers). > A "CountingBloomFilter" would provide an additional accessor: > "getCounts()" that returns a copy of the array of integers (i.e. > the part of the state which it manages). [Again, it's up to the > application developer to devise the strategy for serialization.] > > > There are several other examples such as layered Bloom > > filters ( > https://en.wikipedia.org/wiki/Bloom_filter#Layered_Bloom_filters) > > and attenuated Bloom filters ( > > https://en.wikipedia.org/wiki/Bloom_filter#Attenuated_Bloom_filters) > where > > the internal structure deviates from the BitSet model but that still need > > to present themselves as AbstractBloomFilters do. I think that Abstract > > class is the correct pattern here. > > Not sure: "abstract" is adequate when several subclasses need to > override behaviours; whereas, here the behaviour is the same but > the state is expanded. > > However, there may be a potential issue if the various variants above > need to be mixed and matched. I.e. will there be > * "LayeredBloomFilter > * "CountingLayeredBloomFilter" > * "AttenuatedLayeredBloomFilter" > * "CountingAttenuatedLayeredBloomFilter" > ? > > >> > So the Bloom filter needs an external reference implementation of its > >> > state, so that it can communicate said state with other Bloom filters. > >> The > >> > quesiton of what is necessary for transmission of that state is still > >> open. > >> > I think we can agree a mechanism to identify which bits are on is > >> requried > >> > as part of the state. I have seen discussions that indicate the > hashing > >> > algorithm and other configuration information should also be included, > >> but > >> > I believe this is dependent upon where the filter is being sent. All > >> Bloom > >> > filters within a system, by necessity, have to use the same hash and > >> > configuration parameters. > >> > >> The last statement is not obvious to me, but it surely looks like > >> a simplifying assumption. > >> What are examples of "system" in that sentence? > >> > > > > In my statement above a “system” is a series of processes that accept > Bloom > > filters as arguments. For example a collections of 10K buckets. If you > > created Bloom filter on each bucket, gathered the buckets in to groups of > > 100 and build Bloom filters for each group you could quickly tell if and > > where an object might be by checking the group bloom filters. But every > > Bloom filter in this system must have the same “shape” (The same hash, > > number of bits, and number of functions) for the comparisons to work. All > > of those are in one “system”. However, a single application may have > > multiple systems. > > > > So IIUC, which "hash" and how many "functions" are also part of the > state. [Number of bits is already implicitly conveyed by "getBitSet()".] > > Should the "hash" be configurable? Or can it be an "implementation > detail" (subject to change at the discretion of the library developers)? > > In the former case, I guess that we should create an "enum" that > provides all the supported algorithms. > > >> > >> > Only when sending outside the system does the > >> > other data need to be made available. In my mind this means those data > >> > structures have to be availabe to be transmitted along with the Bloom > >> > filter or filters. > >> > >> Even if so, this is about serialization/persistence; thus, not a "core" > >> issue. > >> If the above "system" is the application level, all "BloomFilter" > >> instances > >> should be able to pass messages between them using instance methods. > >> > >> > > I believe this is a “core” issue. I used an example of a proxy system in > my > > original post. There is no requirement that each of those proxies be on a > > single machine (logical, virtual or physical), in fact it could be argued > > that they probably would not be on the same machine. So passing Bloom > > filters between machines (I am reticent to say “system” here) is one of > the > > core requirements to making Bloom filter usage efficient in modern > > distributed architectures. > > If both sides of the communication channel know the exact type > of "BloomFilter" instances, the application can handle serialization > and any other required functionality that may need to persist across > the channel (e.g. the "counts"). > > >> > So now to the code in the current contribution. The argument that > >> > BloomFilter should be a class not an interface is a good one. However, > >> > as > >> > the internal representation of the bit structure can vary wildly, > >> > >> Why should it? > >> > > > > See response at concreate/abstract discussion above. > > I need more details about the variants (cf. above indeed). > > >> > >> > I think > >> > it should be an abstract class. My reasoning is as follows: Assuming > we > >> > have an external reference implementation of state many/all of the > >> methods > >> > could be implemented against that definition. > >> > >> I don't follow: OO is about behaviour(s), "state" being an > implementation > >> detail. > >> > >> This goes to the discussion of the BloomFilterFunctions and some methods > > in the BloomFilterConfiguration class that estimate sizes. Those > functions > > can be implemented against the “external reference implementation of > state” > > which could be a BitSet. They all operate on the bit patterns using the > > logical “or”, “and”, “xor” functions as well as cardinality, and > > nextSetBit. > > No problem, except that IMHO they are "BitSet" functions! > [And it might be worth implementing them as first class citizen, > using the functional interfaces in "java.util.function".] > > >> > This is what the current > >> > implementation attempts to do, but not successfully. Any > implementation > >> of > >> > the abstract class could develop an internal representation of state > >> > that > >> > is more efficient for whatever the use case is (e.g. counting Bloom > >> > filters, compressed Bloom filters, attenuated Bloom filters, etc) > >> > >> Are these variants impossible to represent with a "BitSet"? > >> > >> I think in all cases where the specific implementation of BloomFilter > > needs to present itself as a BloomFilter, yes BitSet can represent that. > > But is cases where the internal construct is much more complex it may not > > make sense to try to keep an internal BitSet in sync with other > structures > > but rather generate the BitSet if and when it is necessary. > > That's a crucial data point... > So as I wrote in the previous mail, that would indeed entail that > "BloomFilter" is an interface. And a common base class (using > a "BitSet" for its state) makes no sense. > > >> > and > >> > override the methods that are more efficiently answered within its > >> internal > >> > structure. > >> > >> Unless I'm missing the point, this then seems to mandate an > >> interface (rather than an abstract or concrete class), contrary > >> to what was said previously... > >> > >> > > I think was was missing was the discussion of the Functions as noted in > my > > response to your comments about OO behaviour. I am sorry that I did not > > make that clear in the original post, it probably would have reduced the > > confusion significantly. Just to be clear. I think this is an Abstract > > class that implements an “external reference implementation of state” > > interface. That interface will then allow the abstract Bloom filter class > > to implement the functions that are in the BloomFilterFunctions class > (thus > > removing the utility class). In addition they will permit the > > BloomFilterConfigurtion class to continue to implement the estimated size > > methods agains the AbstractBloomFilter class. I think the implementation > of > > the “external reference implementation of state” ends up being > implemented > > by the concrete Bloom filter classes. I don’t like the term “external > > reference implementation of state” and wish I had a good interface name > to > > apply to it, but I started with it earlier and am sticking with it here > > until it is either discarded or named. > > I'm lost here; I don't know what is "the" reference state, since a > single structure cannot for example represent both a standard > filter and a counting filter... > > > > > > >> > In addition, the abstract Bloom filter might implement the “data > >> > read” methods from BitSet directly. This would allow more efficient > >> > implementation of several methods that currently depend upon the > >> > creation > >> > of a BitSet and the manipulation of same. > >> > >> Sure; if any and all variants of a "BloomFilter" implementation can > >> be created from a "BitSet", then such a factory method could be > >> part of the "Builder" interface: > >> > >> public interface BloomFilter { > >> // Core methods... > >> > >> public interface Builder { > >> public Builder from(BitSet s); // Factory method. > >> public BloomFilter build(); > >> // Other builder methods. > >> } > >> } > >> > >> Indeed, this is a valid construct. Currently the constructors take a > > ProtoBloomFilter and BloomFilterConfiguration, or a BitSet. If the > > “external reference implementation of state” is used then I expect to > > change the second constructor to take an AbstractBloomFilter instead. > > Since I'm stuck with what the "reference state" is, I cannot follow... > > > It is > > the ProtoBloomFilter that has the Builder. The ProtoBloomFilter allows us > > to build the hashes but not apply them to build the specific “shape” > > (definition of “shape” noted above). Thus one ProtoBloomFilter can be the > > progenitor of Bloom filters in multiple “systems” (defintion of “system” > > noted above). This also means that the 10K bucket example I noted above > can > > become much more performant because we can segregate the bucket and group > > Bloom filters into different “Systems” so that we get better collision > > avoidance at the group level than we would otherwise. (I can provide data > > and example if you wish.) > > > > Still lost; examples may help. > My impression is that we could come to that after deciding > whether single inheritance will fit the bill or not. > > >> > > >> > > >> > I think this change yields the following: > >> > > >> > > >> > 1. Creation of an external reference implementation of state. > >> > > >> > 2. Conversion of BloomFilter for interface to AbstractBloomFilter > class > >> > > >> > 3. Merge BloomFilterFunctions into AbstractBloomFilter > >> > > >> > 4. Rename StandardBloomFilter to BloomFilter (should we do this?) > >> > > >> > 5. Modify (Standard)BloomFilter and CountingBloomFilter to extends the > >> > AbstractBloomFilter class. > >> > >> It seems that somehow we diverge on the emphasis given to the > >> "state". I think that the questions which I raised above should be > >> answered before jumping to those conclusions. > >> > > > > I agree and hope this response has clarified and answered. > > Not totally. :-} > > >> > >> > On the topic of an external reference implementation of state, could > >> > this > >> > be an interface that AbstractBloomFilter implements? Might it comprise > >> the > >> > “read data” methods from the BitSet class? > >> > >> IIUC, that will trivially be the case if "BitSet" can be the exchange > >> format. And how to actually perform serialization is a user's issue > >> (i.e. whether to rely on "BitSet" being "Serializable", or to implement > >> a proxy...). > >> > >> To be honest, I had not worked through the “external reference > > implementation of state” but initially thought it might be a BitSet. > After > > the above, perhaps not. Again, I haven’t quite thought it all through. If > > “external reference implementation of state” really is an interface then > it > > must be possible to construct a Bloom filter from an instance of it. > > Otherwise, developers would not be able to serialize the data, > deserialize > > into an instance of the interface and recreate the filter. On the other > > hand, the more complext Bloom filters might require other data (e.g. a > > sparse list of counts) to fully reconstruct them. These later cases are > > probably best considered special cases for each filter type and addressed > > within the specific implementation. > > So we came to an similar conclusion that there is definitely an > issue if the "reference" is supposed to be able to represent all > the filter variants. If this is true, then we'd have a single > "MegaBloomFilter" that combines "Standard", "Counting", > "Layered", etc. and whatever additional functionality might be > required later on... But I may be missing many things now... > > I still think that talking about serialization does not help at all. > An example (a unit test maybe) of what should be possible > in the distributed architecture which you evoked would help > fix ideas. > > >> > > >> > [1] > https://www.eecs.harvard.edu/~michaelm/NEWWORK/postscripts/cbf2.pdf > >> > > >> > > > > As an aside I was reading about "The Humpty Dumpty Principle" and came > > across > > > https://definitionsinsemantics.blogspot.com/2012/03/humpty-dumpty-principle-in-definitions.html > . > > The article makes me wish I have never used the term “external reference > > implementation of state” > > > > Thanks for your patience and for taking the time to read through all of > > this, > > Claude > > To be pragmatic, I wonder whether I should create a branch on > within the Apache repository, so that all (including me) can > contribute and actually test the various alternatives... > > Regards, > Gilles > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org > For additional commands, e-mail: dev-h...@commons.apache.org > > -- I like: Like Like - The likeliest place on the web <http://like-like.xenei.com> LinkedIn: http://www.linkedin.com/in/claudewarren