Re: [collections] Bloom filters

Alex Herbert Sat, 14 Mar 2020 12:07:51 -0700


> On 14 Mar 2020, at 17:02, Claude Warren <cla...@xenei.com> wrote:
> 
> I am confused.
> 
> If we restrict what Shape stores to "m" and "t" we can provide methods to
> estimate "n" and "p" for the estimated "n"
> 
> If the constructors for Shape remains as as they are then developers don't
> need to go online to get numbers to plug into Shape.  They only need to go
> online if they want to understand the interplay between the values.
> 
> I think that most of the time developers will know the number of items and
> a probability and let "k" and "m" default.
> 
> It seems like the API is built into the Shape constructor.  Pass in any
> hashFunctionIdentity and various parameters and you can get the other
> parameters.

I know you can do that. It just is not very nice to have to create a Shape 
object just to see what the probability would be and have to handle exceptions 
being thrown if you try bad values. We can refactor the Shape computations to a 
calculator to allow programatic access. But since this is not essential and it 
does not really matter then I left it for now and merged what we currently 
have. The API can always be added to but it is hard to take away.

I do think we should validate that n < m in the constructor if you use both. It 
is impossible to try and store more items than you have number of bits. I think 
this should be raised as an error. It would prevent use of the constructor with 
n and m mistakenly swapped.

As for storing n or not then I’ll quote you from GitHub so it is all discussed 
here:

"If you want to just store `m` and `k` then I would like to see methods defined 
to estimate `n` and `p`

  `n = ceil( m * ln2 / k )`

  The importance of Shape is that it tells the application what the Bloom 
filters are intended to look like.  The saturation of individual filters may 
vary but it provides a baseline from which to estimate.  If you think about 
building a collections of filters with a gating filter.  You can start the 
shape of the gating filter with the inverse of the probability of the shape of 
the filters that are to be stored.  But to get there you have to have the `n` 
for the shape.”

So we could choose to not store n and derive it from m and k. But that would be 
an “optimal” n and loses the intent of the constructor of the shape.

If you do store one more of either n or p then you have made it clear what the 
intension was when you created the shape, i.e. are you looking for a very low 
false-positive rate (1e-9) or just an acceptable one (1e-3). In this case it 
does make sense to store it in order to keep the intent of the original code 
that constructed the shape captured in the shape along with the Bloom filter. 
So let’s leave this.

I’ve merged the things I’ve been updating and now will look at revising the 
Bloom filter interface to use iterators instead of a Hasher to return the 
indexes.

Other business:

1. Why are the NAME constants in the HashFunction implementations public? I 
think they could be made private. You can always create an instance and use 
getName() to find out what the name is. If public then the javadoc should be 
updated to state the NAME refers not to the entire HashFunction (including 
process type and signedness) but only the underlying method/algorithm used to 
hash unlimited bytes to a defined length of pseudorandom bytes. The name does 
not capture how the defined length of pseudorandom bytes is converted to a 
stream of int values.

2. Should we hard-code the signature? It is currently checked in the unit tests 
that the getSignature method returns the same as the computed value (by 
actually computing it). The signature then becomes like a serialVersionUID. It 
is hard coded but means something. If hard coded then it makes instantiation of 
the HashFunctions free from computational cost. 

Note I did look at using an on-demand holder idiom, e.g. for 
ObjectsHashIterative:

    /**
     * Initialise the signature for this hash function on-demand.
     *
     * @see <a 
href="https://en.wikipedia.org/wiki/Initialization-on-demand_holder_idiom";>
     * Initialisation on demand</a>
     */
    private static class SignatureHolder {
        /** The signature. */
        static final long signature;

        static {
            signature = Signatures.getSignature(new ObjectsHashIterative());
        }
    }

    @Override
    public long getSignature() {
        return SignatureHolder.signature;
    }

But given the getSignature() method is to be used a lot it may be better to 
just hard code it. I did not actually investigate performance verses a static 
final constant, instance final constant or an inner class static final constant.

IMO simplest is best and hard-coding it is appropriate here.

Alex

> 
> On Sat, Mar 14, 2020 at 9:50 AM Alex Herbert <alex.d.herb...@gmail.com>
> wrote:
> 
>> Should we not provide an API to compute this in code. Otherwise every
>> developer who uses the code will have to go to the online Bloom filter
>> calculator to check. They cannot do this dynamically in a program. If the
>> equations are standard then I think we should provide a calculator.
>> 
>> And Shape does not need to store the number of items to support a Bloom
>> filter. Strictly speaking it only needs the number of bits. But bundling
>> the hash function identity and number of hash functions saves you having to
>> pass that separately to any Bloom filter and removes the requirement to
>> specify these separately in the Bloom filter interface.
>> 
>> 
>> 
>> On Sat, 14 Mar 2020, 09:31 Claude Warren, <cla...@xenei.com> wrote:
>> 
>>> Shape is not intended to "Perform the standard computations using some of
>>> n, m, k, p to produce optimal values for the other values of n, m, k, p:"
>>> that is left to the developer to determine possibly with the help of
>>> https://hur.st/bloomfilter/ as referenced in the class javadoc. However,
>>> writing the constructors ends up performing the standard computations.
>>> 
>>> On Sat, Mar 14, 2020 at 8:54 AM Alex Herbert <alex.d.herb...@gmail.com>
>>> wrote:
>>> 
>>>> 
>>>> 
>>>>> On 10 Mar 2020, at 01:36, Alex Herbert <alex.d.herb...@gmail.com>
>>> wrote:
>>>>> 
>>>>> I was going to modify the BloomFilter interface as discussed on this
>>>> thread.
>>>>> 
>>>>> However before that I have made some changes to the current code to
>>> tidy
>>>> it up. PR is [1]. The changes target the following:
>>>> 
>>>>> 
>>>>> [1] https://github.com/apache/commons-collections/pull/140 <
>>>> https://github.com/apache/commons-collections/pull/140>
>>>>> 
>>>> 
>>>> This PR has updated the validations that the Shape does during
>>>> construction. Number of bits has changed from >= 8 to > 0. But working
>> on
>>>> this it seems to me that the Shape is a composite class and should be
>>>> refactored.
>>>> 
>>>> I think that the Shape is combining two functionalities and should be
>>>> separated:
>>>> 
>>>> 1. Perform the standard computations using some of n, m, k, p to
>> produce
>>>> optimal values for the other values of n, m, k, p:
>>>> 
>>>> n = number of items
>>>> m = number of bits
>>>> k = number of hash functions
>>>> p = Probability of false positives (when storing n items)
>>>> 
>>>> n = ceil(m / (-k / ln(1 - exp(ln(p) / k))))
>>>> p = pow(1 - exp(-k / (m / n)), k)
>>>> m = ceil((n * ln(p)) / ln(1 / pow(2, ln(2))))
>>>> k = round((m / n) * ln(2))
>>>> 
>>>> 2. Provide the functional shape for a Bloom filter
>>>> 
>>>> This only requires storing m and k. It also requires storing the hash
>>>> function identity.
>>>> 
>>>> I suggest:
>>>> 
>>>> - Moving all the computations to a ShapeCalculator
>>>> 
>>>> - Only storing k and m in the Shape.
>>>> 
>>>> - If you want to know n given p or p given n then you just call the
>>> method
>>>> in the ShapeCalculator.
>>>> 
>>>> - Change constructors in the Shape to allow construction with:
>>>> 
>>>> n, m; where n < m; compute k
>>>> n, m, k; where n < m; validate p
>>>> n, p; compute k and m
>>>> m, k, p; compute n
>>>> k, m;
>>>> 
>>>> This adds the constructor (k, m) which is all you actually need for a
>>>> working Bloom filter. Possible we should validate k < m.
>>>> 
>>>> The constructor using n, m and k only serves to validate p is less
>> than 1
>>>> when the filter is saturated since n is only used to compute p. Perhaps
>>>> this constructor should be removed.
>>>> 
>>>> My argument is that a Bloom filter does not require anything to
>> function
>>>> other than the number of bits (m) and a hash functions (k). Typical use
>>>> case is to construct using (n, p) if you know your number of items and
>>>> filter efficiency, or a constructor using m if you know your memory
>>>> limitations.
>>>> 
>>>> Do we need any other constructors, e.g.?
>>>> 
>>>> m, p; compute k
>>>> 
>>>> Note that bundling the HashFunctionIdentity with the Shape ties the
>> Shape
>>>> to a functional Bloom filter. It could be renamed to BloomFilterShape
>> to
>>>> make it clear that it is directly linked to the Bloom filter. But that
>> is
>>>> over verbose.
>>>> 
>>>> However the name Shape is generic and does not incorporate the fact
>> that
>>>> it also has a HashFunctionIdentity. There should be something in the
>>> Shape
>>>> javadoc to make it clear the Shape is not just a store for values
>>>> associated with any Bloom filter (m, k) but is specific to a working
>>> Bloom
>>>> filter as it also defines the hash function to be used with a Bloom
>>> filter.
>>>> 
>>>> Alex
>>>> 
>>>> 
>>>> 
>>> 
>>> --
>>> I like: Like Like - The likeliest place on the web
>>> <http://like-like.xenei.com>
>>> LinkedIn: http://www.linkedin.com/in/claudewarren
>>> 
>> 
> 
> 
> -- 
> I like: Like Like - The likeliest place on the web
> <http://like-like.xenei.com>
> LinkedIn: http://www.linkedin.com/in/claudewarren

Re: [collections] Bloom filters

Reply via email to