Should we not provide an API to compute this in code. Otherwise every
developer who uses the code will have to go to the online Bloom filter
calculator to check. They cannot do this dynamically in a program. If the
equations are standard then I think we should provide a calculator.

And Shape does not need to store the number of items to support a Bloom
filter. Strictly speaking it only needs the number of bits. But bundling
the hash function identity and number of hash functions saves you having to
pass that separately to any Bloom filter and removes the requirement to
specify these separately in the Bloom filter interface.



On Sat, 14 Mar 2020, 09:31 Claude Warren, <cla...@xenei.com> wrote:

> Shape is not intended to "Perform the standard computations using some of
> n, m, k, p to produce optimal values for the other values of n, m, k, p:"
> that is left to the developer to determine possibly with the help of
> https://hur.st/bloomfilter/ as referenced in the class javadoc. However,
> writing the constructors ends up performing the standard computations.
>
> On Sat, Mar 14, 2020 at 8:54 AM Alex Herbert <alex.d.herb...@gmail.com>
> wrote:
>
> >
> >
> > > On 10 Mar 2020, at 01:36, Alex Herbert <alex.d.herb...@gmail.com>
> wrote:
> > >
> > > I was going to modify the BloomFilter interface as discussed on this
> > thread.
> > >
> > > However before that I have made some changes to the current code to
> tidy
> > it up. PR is [1]. The changes target the following:
> >
> > >
> > > [1] https://github.com/apache/commons-collections/pull/140 <
> > https://github.com/apache/commons-collections/pull/140>
> > >
> >
> > This PR has updated the validations that the Shape does during
> > construction. Number of bits has changed from >= 8 to > 0. But working on
> > this it seems to me that the Shape is a composite class and should be
> > refactored.
> >
> > I think that the Shape is combining two functionalities and should be
> > separated:
> >
> > 1. Perform the standard computations using some of n, m, k, p to produce
> > optimal values for the other values of n, m, k, p:
> >
> > n = number of items
> > m = number of bits
> > k = number of hash functions
> > p = Probability of false positives (when storing n items)
> >
> > n = ceil(m / (-k / ln(1 - exp(ln(p) / k))))
> > p = pow(1 - exp(-k / (m / n)), k)
> > m = ceil((n * ln(p)) / ln(1 / pow(2, ln(2))))
> > k = round((m / n) * ln(2))
> >
> > 2. Provide the functional shape for a Bloom filter
> >
> > This only requires storing m and k. It also requires storing the hash
> > function identity.
> >
> > I suggest:
> >
> > - Moving all the computations to a ShapeCalculator
> >
> > - Only storing k and m in the Shape.
> >
> > - If you want to know n given p or p given n then you just call the
> method
> > in the ShapeCalculator.
> >
> > - Change constructors in the Shape to allow construction with:
> >
> > n, m; where n < m; compute k
> > n, m, k; where n < m; validate p
> > n, p; compute k and m
> > m, k, p; compute n
> > k, m;
> >
> > This adds the constructor (k, m) which is all you actually need for a
> > working Bloom filter. Possible we should validate k < m.
> >
> > The constructor using n, m and k only serves to validate p is less than 1
> > when the filter is saturated since n is only used to compute p. Perhaps
> > this constructor should be removed.
> >
> > My argument is that a Bloom filter does not require anything to function
> > other than the number of bits (m) and a hash functions (k). Typical use
> > case is to construct using (n, p) if you know your number of items and
> > filter efficiency, or a constructor using m if you know your memory
> > limitations.
> >
> > Do we need any other constructors, e.g.?
> >
> > m, p; compute k
> >
> > Note that bundling the HashFunctionIdentity with the Shape ties the Shape
> > to a functional Bloom filter. It could be renamed to BloomFilterShape to
> > make it clear that it is directly linked to the Bloom filter. But that is
> > over verbose.
> >
> > However the name Shape is generic and does not incorporate the fact that
> > it also has a HashFunctionIdentity. There should be something in the
> Shape
> > javadoc to make it clear the Shape is not just a store for values
> > associated with any Bloom filter (m, k) but is specific to a working
> Bloom
> > filter as it also defines the hash function to be used with a Bloom
> filter.
> >
> > Alex
> >
> >
> >
>
> --
> I like: Like Like - The likeliest place on the web
> <http://like-like.xenei.com>
> LinkedIn: http://www.linkedin.com/in/claudewarren
>

Reply via email to