> On 10 Mar 2020, at 01:36, Alex Herbert <alex.d.herb...@gmail.com> wrote: > > I was going to modify the BloomFilter interface as discussed on this thread. > > However before that I have made some changes to the current code to tidy it > up. PR is [1]. The changes target the following:
> > [1] https://github.com/apache/commons-collections/pull/140 > <https://github.com/apache/commons-collections/pull/140> > This PR has updated the validations that the Shape does during construction. Number of bits has changed from >= 8 to > 0. But working on this it seems to me that the Shape is a composite class and should be refactored. I think that the Shape is combining two functionalities and should be separated: 1. Perform the standard computations using some of n, m, k, p to produce optimal values for the other values of n, m, k, p: n = number of items m = number of bits k = number of hash functions p = Probability of false positives (when storing n items) n = ceil(m / (-k / ln(1 - exp(ln(p) / k)))) p = pow(1 - exp(-k / (m / n)), k) m = ceil((n * ln(p)) / ln(1 / pow(2, ln(2)))) k = round((m / n) * ln(2)) 2. Provide the functional shape for a Bloom filter This only requires storing m and k. It also requires storing the hash function identity. I suggest: - Moving all the computations to a ShapeCalculator - Only storing k and m in the Shape. - If you want to know n given p or p given n then you just call the method in the ShapeCalculator. - Change constructors in the Shape to allow construction with: n, m; where n < m; compute k n, m, k; where n < m; validate p n, p; compute k and m m, k, p; compute n k, m; This adds the constructor (k, m) which is all you actually need for a working Bloom filter. Possible we should validate k < m. The constructor using n, m and k only serves to validate p is less than 1 when the filter is saturated since n is only used to compute p. Perhaps this constructor should be removed. My argument is that a Bloom filter does not require anything to function other than the number of bits (m) and a hash functions (k). Typical use case is to construct using (n, p) if you know your number of items and filter efficiency, or a constructor using m if you know your memory limitations. Do we need any other constructors, e.g.? m, p; compute k Note that bundling the HashFunctionIdentity with the Shape ties the Shape to a functional Bloom filter. It could be renamed to BloomFilterShape to make it clear that it is directly linked to the Bloom filter. But that is over verbose. However the name Shape is generic and does not incorporate the fact that it also has a HashFunctionIdentity. There should be something in the Shape javadoc to make it clear the Shape is not just a store for values associated with any Bloom filter (m, k) but is specific to a working Bloom filter as it also defines the hash function to be used with a Bloom filter. Alex