Just a quick note that, on my end, I might be able to get back to looking at this PR soon-ish but it might not be until the week-end.
Gary On Mon, Oct 14, 2019 at 8:12 AM Claude Warren <cla...@xenei.com> wrote: > Greetings, > > > I feel like I have been beating my head against a wall trying to get this > contribution accepted; this is not a complaint about process or > personalities, just a statement of how I feel. I realize that I have made > progress but I also realize that there is a long way to go. Furthermore, > the other potential users and supporters have not responded to any > communication about this proposal so I am floundering on that front too. > What I want to do is take a step back. Let’s remove the collection packages > from consideration and work out the base package first. With that in mind I > would like to present reasoning behind the classes as they now stand and > suggest a way forward to get the core classes in before adding the > collections. > > > As discussed earlier Bloom filters are a concept. In most peoples mind’s > they are bit vectors with specific named functions (e.g. match => A and B = > A, hammingValue => cardinality(this)). This idea of the filter as a bit > vector led me to using a BitSet. Originally the BloomFilter was > serializable but valid arguments against that and toward making the > internal representation visible enought to serialize that lead to making > the BitSet public. In addition to the normal Bloom filter methods > (hammingValue, match, etc.) there are some methods (log, cosine distance, > jaccard distance, etc.) that are used by researchers and other attempting > to do some work with indexing and such for large collections of filters. > > > Finally there is the question of “normal” usage. In his paper on > “Compressed Bloom Filters”[1] (an implementation we do not have yet) > Michael Mitzenmacher states “the Bloom filter plays a dual role. It is both > a data structure being used at the proxies, and a message being passed > between them”. He is discussing the issue of using bloom filters in a > distributed cache where the proxies distributed their bloom filter so that > all the proxies know which proxy might have the document. So the Bloom > filter needs both an internal representation and a mechanism to pass the > data between systems that use it. It seems obvious that the BitSet works as > the mechanism to pass the data between systems, though an immutable version > would be preferred. > > > So the Bloom filter needs an external reference implementation of its > state, so that it can communicate said state with other Bloom filters. The > quesiton of what is necessary for transmission of that state is still open. > I think we can agree a mechanism to identify which bits are on is requried > as part of the state. I have seen discussions that indicate the hashing > algorithm and other configuration information should also be included, but > I believe this is dependent upon where the filter is being sent. All Bloom > filters within a system, by necessity, have to use the same hash and > configuration parameters. Only when sending outside the system does the > other data need to be made available. In my mind this means those data > structures have to be availabe to be transmitted along with the Bloom > filter or filters. > > > So now to the code in the current contribution. The argument that > BloomFilter should be a class not an interface is a good one. However, as > the internal representation of the bit structure can vary wildly, I think > it should be an abstract class. My reasoning is as follows: Assuming we > have an external reference implementation of state many/all of the methods > could be implemented against that definition. This is what the current > implementation attempts to do, but not successfully. Any implementation of > the abstract class could develop an internal representation of state that > is more efficient for whatever the use case is (e.g. counting Bloom > filters, compressed Bloom filters, attenuated Bloom filters, etc) and > override the methods that are more efficiently answered within its internal > structure. In addition, the abstract Bloom filter might implement the “data > read” methods from BitSet directly. This would allow more efficient > implementation of several methods that currently depend upon the creation > of a BitSet and the manipulation of same. > > > I think this change yields the following: > > > 1. Creation of an external reference implementation of state. > > 2. Conversion of BloomFilter for interface to AbstractBloomFilter class > > 3. Merge BloomFilterFunctions into AbstractBloomFilter > > 4. Rename StandardBloomFilter to BloomFilter (should we do this?) > > 5. Modify (Standard)BloomFilter and CountingBloomFilter to extends the > AbstractBloomFilter class. > > > On the topic of an external reference implementation of state, could this > be an interface that AbstractBloomFilter implements? Might it comprise the > “read data” methods from the BitSet class? > > > [1] https://www.eecs.harvard.edu/~michaelm/NEWWORK/postscripts/cbf2.pdf > > > -- > I like: Like Like - The likeliest place on the web > <http://like-like.xenei.com> > LinkedIn: http://www.linkedin.com/in/claudewarren >