> On 10 Mar 2020, at 01:36, Alex Herbert <alex.d.herb...@gmail.com> wrote:
> 
> I was going to modify the BloomFilter interface as discussed on this thread.
> 
> However before that I have made some changes to the current code to tidy it 
> up. PR is [1]. The changes target the following:

> 
> [1] https://github.com/apache/commons-collections/pull/140 
> <https://github.com/apache/commons-collections/pull/140>
> 

This PR has updated the validations that the Shape does during construction. 
Number of bits has changed from >= 8 to > 0. But working on this it seems to me 
that the Shape is a composite class and should be refactored.

I think that the Shape is combining two functionalities and should be separated:

1. Perform the standard computations using some of n, m, k, p to produce 
optimal values for the other values of n, m, k, p:

n = number of items
m = number of bits
k = number of hash functions
p = Probability of false positives (when storing n items)

n = ceil(m / (-k / ln(1 - exp(ln(p) / k))))
p = pow(1 - exp(-k / (m / n)), k)
m = ceil((n * ln(p)) / ln(1 / pow(2, ln(2))))
k = round((m / n) * ln(2))

2. Provide the functional shape for a Bloom filter

This only requires storing m and k. It also requires storing the hash function 
identity.

I suggest:

- Moving all the computations to a ShapeCalculator

- Only storing k and m in the Shape.

- If you want to know n given p or p given n then you just call the method in 
the ShapeCalculator.

- Change constructors in the Shape to allow construction with:

n, m; where n < m; compute k
n, m, k; where n < m; validate p
n, p; compute k and m
m, k, p; compute n
k, m;

This adds the constructor (k, m) which is all you actually need for a working 
Bloom filter. Possible we should validate k < m.

The constructor using n, m and k only serves to validate p is less than 1 when 
the filter is saturated since n is only used to compute p. Perhaps this 
constructor should be removed.

My argument is that a Bloom filter does not require anything to function other 
than the number of bits (m) and a hash functions (k). Typical use case is to 
construct using (n, p) if you know your number of items and filter efficiency, 
or a constructor using m if you know your memory limitations.

Do we need any other constructors, e.g.?

m, p; compute k

Note that bundling the HashFunctionIdentity with the Shape ties the Shape to a 
functional Bloom filter. It could be renamed to BloomFilterShape to make it 
clear that it is directly linked to the Bloom filter. But that is over verbose.

However the name Shape is generic and does not incorporate the fact that it 
also has a HashFunctionIdentity. There should be something in the Shape javadoc 
to make it clear the Shape is not just a store for values associated with any 
Bloom filter (m, k) but is specific to a working Bloom filter as it also 
defines the hash function to be used with a Bloom filter.

Alex


Reply via email to