Re: [collections] Bloom filters

Alex Herbert Mon, 02 Mar 2020 02:49:20 -0800

On 02/03/2020 09:38, Claude Warren wrote:

It is not too late to update the BloomFIlter interface to have the merge
return a boolean.  The incorrect Shape would still throw an exception, so
the return value would only come into play if the bits could not be set.


thoughts?


I don't see the harm in it. But what would the return value be for?

For a standard collection it would be if the collection was changed bythe operation:

Collection.add/remove return "true if this collection changed as aresult of the call"


So here is the equivalent:

return "true if this filter was changed as a result of the call"

This is computationally slow to track. It also is confusing if thefilter was successfully merged but no bits were changed to then returnfalse because the filter was actually incorporated. So it would go alongthe lines that we discussed for the counting Bloom filter:


return "true if this filter was successfully merged as a result of the call"

For most cases in the current library it would be true when an exceptionis not thrown. However the merge of the counting Bloom filter may havereason to return false, e.g. overflow.

On Mon, Mar 2, 2020 at 7:56 AM Claude Warren <cla...@xenei.com> wrote:

for the remove(), add(), and subtract() methods I agree that void is not
correct and it should be boolean and be the same as the value you would get
from calling isValid().

You are correct the getCounts() should return an iterator of some type on
int[], I don't know why I thought long[].  I am happy with a plain
Iterator<int[]> as the return.

For the getCounts() method I am still looking for a way around having tocreate an <index, count> pair for everything in the filter. Analternative to an iterator is to use the consumer idea. Given there isno primitive specialisation of BiConsumer<T, U> in JDK 8 functions wedefine our own:


interface BitCountConsumer {
    void accept(int index, int count);
}

The CountingBloomFilter then has:

    void forEachCount(BitCountConsumer consumer);
    // Or
    void getCounts(BitCountConsumer consumer);

You can then directly pass the counts from the backing storage to thedestination.


Advantages:
- No object creation
Disadvantages
- The counts cannot be streamed

An alternative is to provide an Iterator of long with the index andcount packed into a long with methods to extract them:


   PrimativeIterator.OfLong getCounts();

   default static int getIndex(long indexCountPair) {
        return (int) (indexCountPair >>> 32);
   }

   default static int getCount(long indexCountPair) {
        return (int) indexCountPair;
   }

This will almost certainly be a cause for bugs/issues from users.

I believe that the counts will be used for 2 main use cases:

1. Storage

2. Adding to another counting Bloom filter

Both cases are likely to be done serially and not in parallel. Soproviding a consumer based API to receive the counts would work.


WDYT?


Claude



On Mon, Mar 2, 2020 at 1:02 AM Alex Herbert <alex.d.herb...@gmail.com>
wrote:

On 1 Mar 2020, at 15:39, Claude Warren <cla...@xenei.com> wrote:

I think the CountingBloomFilter interface needs to extend BloomFilter.

I said that but did not write it, sorry.

I think I am confused.

I would expect CountingBloomFilter to have

interface CountingBloomFilter extends BloomFilter {
    // these 2 methods are the reverse of merge()
    void remove( BloomFilter );
    void remove( Hasher );

Fine. Same intention but different method names. But why void? It forces
you to check if the remove was valid with a second call. On the plus side
it matches the void merge(…) methods and in most cases a user would not
care to check anyway. If they are controlling the filter then leave it to
them to make sure they do not remove something they did not add.

    // these 2 methods are the addition/subtraction of counts
    void add( CountingBloomFilter )
    void subtract( CountingBloomFilter );

Fine. But same comment as above with the void return.

    // 2 methods to retrieve data
    Stream<long[]> getCounts();

I don’t like this use of long[]. In my previous post I argued that if you
were to ever want to store more than max integer items in a filter then the
Bloom filter would have more bit indices than max integer. So we never have
to support long counts. A filter that exceeds max integer for a count is
highly likely to be saturated and no use as a filter anyway.

For most backing implementations the object type of the stream will be
different so you will have to write a Spliterator<T> implementation or wrap
some iterator anyway. So why not return the Spliterator:

    Spliterator<int[]> getCounts();

Since the backing implementation will likely not store int[] pairs then
this will have a lot of object creation and garbage collection overhead to
go through the counts. This does not seem to be a big concern here if the
purpose is the same as for the BloomFilter for long[] getBits(), i.e. to
get a canonical representation for storage.

Note: The Spliterator may not have a known size (number of non zero bits)
at creation, for example if the counts are stored in a fixed size array.
Thus efficient parallel traversal by binary tree splitting is limited by
how evenly the counts are distributed. For a backing implementation using a
collection then the size should be known. In this case a Spliterator would
be of more use than a plain Iterator. You can convert one to the other
using:

java.util.Spliterators:
public static<T> Iterator<T> iterator(Spliterator<? extends T>
spliterator)
public static <T> Spliterator<T> spliterator(Iterator<? extends T>
iterator,
                                                  long size,
                                                  int characteristics)

So which to choose for the API?

The Hasher currently uses an Iterator:

PrimitiveIterator.OfInt getBits(Shape shape);

In the case of a StaticHasher this could return a spliterator. But the
DynamicHasher needs a reworking of the internal Iterator class. It could be
a Spliterator to use the new IntConsumer API but in most (all) cases
splitting would not be possible for dynamic hashing as the parts are
produced in order. It is likely that they will be consumed sequentially too.

I would suggest that Spliterator is the more modern implementation,
despite not always being applicable to parallelisation in a stream.
Currently the iterator from the Hasher is used in forEachRemaining() and
while loop is approximately equal measure. The while loops are for a fast
exit and would be uglier if rewritten for a
Spliterator.tryAdvance(IntConsumer) syntax.

There is a use of the IteratorChain in HasherBloomFilter that would need
a rethink for spliterators.

The path of least resistance is to use Iterator<int[]> for the API of
CountingBloomFilter to be consistent with Hasher’s use of Iterator.

WDYT?

    boolean isValid()

Fine. Allows some level of feedback that the counts are corrupt.

}

Claude


On Sun, Mar 1, 2020 at 2:48 PM Alex Herbert <alex.d.herb...@gmail.com

<mailto:alex.d.herb...@gmail.com>>

wrote:

On 1 Mar 2020, at 09:28, Claude Warren <cla...@xenei.com> wrote:

The idea of a backing array is fine and the only problem I see with

it is

in very large filters (on the order of 10^8 bits and larger) but

document

the size calculation and let the developer worry about it.

Let us look at the use case where we max out the array. Using the Bloom
filter calculator:

n = 149,363,281
p = 0.001000025 (1 in 1000)
m = 2147483647 (256MiB)
k = 10

n = 74,681,641
p = 0.000001 (1 in 999950)
m = 2147483647 (256MiB)
k = 20

n = 49,787,761
p = 0.000000001 (1 in 999924899)
m = 2147483647 (256MiB)
k = 30

So you will be able to put somewhere in the order of 10^8 or 10^7 items
into the filter. I would say that anyone putting more than that into

the

filter has an unusual use case. The CountingBloomFilter can throw an
exception if m is too large and will throw an OutOfMemoryError if you
cannot allocate an array large enough.

One clear point here is that you cannot use a short as a 16-bit count
would easily overflow. So you have to use an integer array for the

counts.

A maximum length int[] is roughly 8GB.

What would another implementation cost in terms of memory? The
TreeMap<Integer, Integer> was the most space efficient. In the previous
e-mail the saturation of a Bloom filter bits was approximately 50%

when at

the intended capacity. So we have to estimate the size of a TreeMap
containing Integer.MAX_VALUE/2 indices ~ 2^30. The memory test shows

the

TreeMap memory scales linearly with entries:

32768 /  65536 (0.500) : TreeMap<Integer, Integer>      =  1834061

bytes

65536 / 131072 (0.500) : TreeMap<Integer, Integer>      =  3669080

bytes

131072 / 262144 (0.500) : TreeMap<Integer, Integer>      =  7339090

bytes

So what is the memory for a TreeMap with 2^30 indices. I make it about:

(2^30 / 131,072) * 7,339,090 bytes ~ 6e10 bytes = 55.99 GB

I would say that this amount of RAM is unusual. It is definitely not as
efficient as using an array. So very large counting Bloom filters are

going

to require some thought as to the hardware they run on. This may not

be the

case in 10 years time.

I would say that we try an int[] backing array for the storage
implementation and document it’s limitations. A different

implementation

could be provided in future if required.

This could be done by making CountingBloomFilter an interface that

extends

BloomFilter with the methods:

subtract(BloomFilter filter)
subtract(Hasher filter)

These will negate the effect of the corresponding merge(BloomFilter)
operation.

Do we also need access to the counts and add/subtract of another
CountingBloomFilter?:

add(CountingBloomFilter filter);
subtract(CountingBloomFilter filter);

Iterator<int[]> getCounts();
int getSize(); // Number of items added

The CountingBloomFilter is then an interface that defines how to

reverse

the merge of some bits into the filter.

My concern is the inefficiency of the creation of objects in any method
that provides access to the counts (e.g. above using an iterator as for
Hasher.getBits()). I presume this method would be to allow some type of
storage/serialisation of the filter akin to the long[] getBits()

method of

BloomFilter. So it may be better done using a method:

int getCount(int index);

The caller can then use long[] getBits() to get the indices set in the
filter and then for each non-zero bit index call getCount(index). Or

just

not include the method as the counts are only of concern when storing

the

filter. This functionality is cleaner pushed into an implementation.

In a previous post we discussed whether to throw an exception on
overflow/underflow or raise in an invalid flag. Using the invalid flag

idea

the interface would be:

interface CountingBloomFilter {
    int add(CountingBloomFilter filter);
    int subtract(BloomFilter filter);
    int subtract(Hasher filter);
    int subtract(CountingBloomFilter filter);
    int getStatus();

    // Maybe
    int getSize();
    int getCount(int index);
}

The status would be negative if any operation overflowed/underflowed,

or

zero if OK. The current status is returned by the add/subtract methods.

However I note that overflow may not be a concern. The number of items

to

add to a filter to create overflow would be using a filter with a

number of

bits that is unrealistic to store in memory:

n = 2147483647
p = 0.001000025 (1 in 1000)
m = 30875634182 (3.59GiB)
k = 10

If you want to add 2 billion items (and overflow an integer count) then
your filter would be so big it would break the rest of the API that

uses a

32-bit int for the bit index.

Thus only underflow is a realistic concern. This could be documented as
handled in an implementation specific manner (i.e. throw or ignore).

The

API is then simplified to:

interface CountingBloomFilter {
    boolean add(CountingBloomFilter filter);
    boolean subtract(BloomFilter filter);
    boolean subtract(Hasher filter);
    boolean subtract(CountingBloomFilter filter);
    int getStatus();

    // Maybe
    int getSize();
    int getCount(int index);
}

The boolean is used to state that add/subtract did not over/underflow.
Implementations can throw if they require it.

The question then becomes what does getSize() represent if an

add/subtract

did not execute cleanly. Under this scheme it would be the number of

(add -

subtract) operations. The status flag would be used to indicate if the

size

is valid, or any of the counts from getCount(). The simpler API is to

not

allow access to counts/size or adding/subtracting counts:

interface CountingBloomFilter {
    boolean subtract(BloomFilter filter);
    boolean subtract(Hasher filter);
    int getStatus();
    // Or something like ...
    boolean isValid();
}

This filter is then only concerned with reversing the merge of Bloom
filters with a valid status flag to indicate that the current state is
consistent (i.e. all filters have been cleanly merged/subtracted).

WDYT?

As for the merge question.  merge is a standard bloom filter

operation.

It

is well defined in the literature.  Merging a bloom filter into a

counting

bloom filter means incrementing the bit counts.  I  think that

merge/remove

should continue to operate as though the parameter were a standard

bloom

filter.

OK. So the count is to represent the number of filters that had a bit

set

at that index. This makes it more clear.

We had spoken of implementing and adding/deleting method pair that

would

operate on CountingBloomFilters and would add/subtract the counts.

(e.g.

add(CountingBloomFilter) and subtract(CountingBloomFilter))

I disagree with your proposal for the merge(Hasher) implementation,

and I

am not certain that an add(Hasher) makes sense.  First consider that

the

Hasher returns the bits that are to be enabled in the Bloom filter so
collisions are expected.  In the normal course of events a Hasher is

used

to create a normal Bloom filter where all the duplicates are removed.

That

filter is then merged into a CountingBloomFilter.  So in some sense

the

Hasher and the normal Bloom filter are the same.  So I would expect

the

merge of the Hasher and the merge of the normal Bloom filter created

from

that hasher into a CountingBloomFilter to yield the same result.  If

you

wanted to add an add(Hasher)/delete(Hasher) pair of functions to a
CountingBloomFilter you could implement with duplicate counting, but

I am

not certain of the validity of such a count and I fear that it muddies

the

waters with respect to what the CountingBloomFilter is counting.

Agreed.

Claude

On Sat, Feb 29, 2020 at 2:13 PM Alex Herbert <

alex.d.herb...@gmail.com

<mailto:alex.d.herb...@gmail.com <mailto:alex.d.herb...@gmail.com>>>

wrote:

On 29 Feb 2020, at 07:46, Claude Warren <cla...@xenei.com <mailto:

cla...@xenei.com> <mailto:

cla...@xenei.com <mailto:cla...@xenei.com>>> wrote:

Alex would you take a look at pull request 131 [1].  it adds a new

hasher

implementation and makes the HashFunctionValidator available for

public

use.

https://github.com/apache/commons-collections/pull/131 <

https://github.com/apache/commons-collections/pull/131> <

https://github.com/apache/commons-collections/pull/131 <

https://github.com/apache/commons-collections/pull/131>> <

https://github.com/apache/commons-collections/pull/131 <

https://github.com/apache/commons-collections/pull/131> <

https://github.com/apache/commons-collections/pull/131 <

https://github.com/apache/commons-collections/pull/131>>>

OK. I’ll take a look.

I’ve been thinking about the counting Bloom filter and the backing
storage. In summary:

1. The backing storage should be a fixed array.
2. Merging a Hasher should count duplicate indices, not flatten them

all

to a single count.

For background I’ve used the standard formulas to estimate the

number of

indices that will be non-zero in a Bloom filter. The wikipedia page

gives

this formula for the expected number of bits set to 0 (E(q)) if you

have

inserted i elements into a filter of size m using k hash functions:

E(q) = (1 - 1/m)^ki"

So a rough guess of the number of indices (bits) used by a filter is
1-E(q).

Here is a table of Bloom filters with different collision

probabilities

and the proportion of bits that will be set when 1%, 10%, 100% of the
capacity of the filter has been met:

n       p       m       k       I       E(q)    bits
1000    1E-04   19171   13      10      0.9932  0.0068
1000    1E-04   19171   13      100     0.9344  0.0656
1000    1E-04   19171   13      1000    0.5076  0.4924
1000    1E-05   23963   17      10      0.9929  0.0071
1000    1E-05   23963   17      100     0.9315  0.0685
1000    1E-05   23963   17      1000    0.4919  0.5081
1000    1E-06   28756   20      10      0.9931  0.0069
1000    1E-06   28756   20      100     0.9328  0.0672
1000    1E-06   28756   20      1000    0.4988  0.5012
10000   1E-06   287552  20      100     0.9931  0.0069
10000   1E-06   287552  20      1000    0.9328  0.0672
10000   1E-06   287552  20      10000   0.4988  0.5012

The point is that if you create a Bloom filter and fill it to 10% of

the

intended capacity the number of indices used will be about 6-7% of

the

filter bits.

So how to store the counts? Currently the counting bloom filter uses

TreeMap<Integer, Integer>. I tried:

TreeMap<Integer, Integer>
HashMap<Integer, Integer>
TreeSet<MutableCount>
HashSet<MutableCount>
int[]

The MutableCount is a custom object that stores the bit index and

uses

it

for a hash code and then has a mutable integer count field. It allows

the

count to be incremented/decremented if the object is in the set:

   static final class MutableCount implements

Comparable<MutableCount> {

       final int key;
       int count;
       // etc
   }

This is adapted from the Bag<T> collection which stores an item count

with

a MutableInteger. Here the mutable count is part of the same object T
inserted into the Set. So you can find the object, change the count

and

not

have to put it back into the set. This is more efficient than

changing

the

Integer stored in a Map.

I’ve estimated memory usage using an idea based on this article from
JavaWorld: Java Tip 130: Do you know your data size? [1].

Basically you:

- create an object and throw it away. All classes are then

initialised.

- Then you free memory (run garbage collection) and get the current

memory

usage
- Then create a lot of your object (held in an array)
- Then measure memory usage again
- memory = (after - before) / count

Here is some example output for n bits set in size m:

13107 / 262144 (0.050) : TreeMap<Integer, Integer>      =   733947

bytes

26214 / 262144 (0.100) : TreeMap<Integer, Integer>      =  1467866

bytes

13107 / 262144 (0.050) : TreeSet<MutableCount>          =   838928

bytes

26214 / 262144 (0.100) : TreeSet<MutableCount>          =  1677776

bytes

13107 / 262144 (0.050) : HashMap<Integer, Integer>      =  1677712

bytes

26214 / 262144 (0.100) : HashMap<Integer, Integer>      =  2306739

bytes

13107 / 262144 (0.050) : HashSet<MutableCount>          =  1782664

bytes

26214 / 262144 (0.100) : HashSet<MutableCount>          =  2516656

bytes

    0 / 262144 (0.000) : int[]                          =  1048608

bytes

    0 / 262144 (0.000) : short[]                        =   524320

bytes

The estimate is accurate to 0.0001% for the arrays so the method is
working. The HashMap was created with the capacity set to the

expected

capacity of the filter (m).

I’ve chosen these sizes because at 5% full a HashSet is less memory
efficient than using a fixed size array, and at 10% the TreeSet is

also

less efficient.

Note that the java.util.Tree/HashSet versions just wrap a Map and

insert a

dummy object for all keys in the Map. So here a Set is not as

efficient

as

a Map because in the Map test I always inserted the same Integer

object

representing 1. This would be the same as using a Set with an Integer

key

but here the Set had to contain the MutableCount which has an extra

int

field and is larger than an Integer.

These data lead me to think that a counting Bloom filter should just

use a

fixed size backing array:

- If created using the same Shape as a standard Bloom filter it uses

fixed size. This has high memory cost when the filter is empty but

when

it

exceeds 10% of the intended capacity it is more efficient than a

dynamic

backing storage.
- All operations will operate in order(n) time for an operation with
another filter with n indices. Each individual index count in the

filter

will have order(1) time for access/update. Performance will be

limited

by

the memory cache of the entire array.

The issue is that a counting Bloom filter with a very low number of
inserted items will be memory inefficient. But under what

circumstance

will

such a filter be used in a short-term lifecycle? If it is simply to

merge

into another filter then this can be done using a merge with a

Hasher.

If

counts are to be merged then perhaps we provide a method to merge

counts

using the same data structure returned by the CountingBloomFilter
getCounts() method, e.g. using a stream of <index,count> pairs:

Stream<int[]> getCounts();
void add(Stream<int[]> counts);

The issue here is the Shape and HashFunctionIdentity of the origin of

the

merge cannot be validated. So just leave it out and use the merge

with a

Hasher.


Thus the next issue with the counting Bloom filter implementation.
Currently when it merges with a Hasher it puts all the indices into a

Set

and so will only increment the count by 1 for each index identified

by

the

Hasher. This appears to miss the entire point of the counting Bloom

filter.

If I hash an objects to generate k indices I would hope that I do

get k

indices. But the hash may not be perfect and I may get [1, k] indices

with

some duplications. This is part of the signature of that object with

the

given hash. So surely a counting Bloom filter should accommodate

this.

If

my Hasher generates the same index 20 times this should result in the

count

of that index incrementing by 20.

The result if that if an object is added direct to a counting Bloom

filter

using a Hasher it will have a different result that if added to a

standard

Bloom filter and then that filter added to the counting Bloom filter.

Opinions on this?

Alex



[1] http://www.javaworld.com/javaworld/javatips/jw-javatip130.html <

http://www.javaworld.com/javaworld/javatips/jw-javatip130.html>

On Tue, Feb 25, 2020 at 12:35 AM Alex Herbert <

alex.d.herb...@gmail.com <mailto:alex.d.herb...@gmail.com>>

wrote:

I have created a PR that contains most of the changes discussed in

this

thread (see [1]).

Please review the changes and comment here or on GitHub.

I have left the CountingBloomFilter alone. I will reimplement the
add/subtract functionality as discussed into another PR.

Alex

[1] https://github.com/apache/commons-collections/pull/133 <

https://github.com/apache/commons-collections/pull/133> <

https://github.com/apache/commons-collections/pull/133 <

https://github.com/apache/commons-collections/pull/133>>

--
I like: Like Like - The likeliest place on the web
<http://like-like.xenei.com <http://like-like.xenei.com/>>
LinkedIn: http://www.linkedin.com/in/claudewarren <

http://www.linkedin.com/in/claudewarren>

--
I like: Like Like - The likeliest place on the web
<http://like-like.xenei.com <http://like-like.xenei.com/> <

http://like-like.xenei.com/ <http://like-like.xenei.com/>>>

LinkedIn: http://www.linkedin.com/in/claudewarren <

http://www.linkedin.com/in/claudewarren> <

http://www.linkedin.com/in/claudewarren <

http://www.linkedin.com/in/claudewarren>>


--
I like: Like Like - The likeliest place on the web
<http://like-like.xenei.com <http://like-like.xenei.com/>>
LinkedIn: http://www.linkedin.com/in/claudewarren <

http://www.linkedin.com/in/claudewarren>


--
I like: Like Like - The likeliest place on the web
<http://like-like.xenei.com>
LinkedIn: http://www.linkedin.com/in/claudewarren


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org

Re: [collections] Bloom filters

Reply via email to