Re: [BloomFilters] changes to BloomFilter

Alex Herbert Fri, 20 Mar 2020 03:52:30 -0700

Sorry for an absence. I have been thinking on ways to move the BloomFilter API 
forward that consolidates the current functionality but makes it simpler to use 
for the common case.

> On 18 Mar 2020, at 17:12, Claude Warren <cla...@xenei.com> wrote:
> 
> bf.getBits() * Long.BYTES  may be as long as Math.Ceil(
> Shape.getNumberOfBits() / 8.0 ) or it may be shorter.

I presume you mean:

bf.getBits().length * Long.BYTES  may be equal to Math.Ceil(
Shape.getNumberOfBits() / 8.0 ) or it may be longer. 

> 
> I am building byte buffers of fixed length that is the maximum size that
> any valid bf.getBits() * Long.BYTES I need to know
> Shape.getNumberOfBytes().
> The  conversion is required for some Bloom filter indexing techniques.

OK. I gave you an example of how to do this. In my example the 
Shape.getNumberOfBytes() was 1 line of code in 8.

This method has a practical use if you are creating a byte[] representation of 
the bits in a filter. So you already have to be writing code to do that. In the 
context of this process it is not going to save you a lot of code. It seems 
like this function is very specific to your need and not something generic 
required within the API.

> 
> And while serialization is outside the scope of the library, it is only
> reasonable that we provide enough information to allow developers to
> serialize/deserialse the data.  For example BloomFilter allows you to get
> either the long[] representation or the list of bit indexes (via OfInt) and
> there are ways to reconstruct a BloomFilter if you were to write that out
> and read it back.

Yes and you are free to do data transfer how you please with that information. 

There is a lot of this API that is changing. I don’t want to keep adding 
methods before the basic functionality is fixed. So leave 
Shape.getNumberOfBytes() for now.

On to the main topic for discussion ...

IMO the basic functionality is not yet fixed. It is hard to use due to the 
separation of the BloomFilter from the hashing of objects you want to put into 
it. Thus at all times to use a BloomFilter you need to also have something that 
can create Hashers. Currently they are easy to create if you have a byte[]. So 
your primary use case of hashing Strings is easy. But what about other things? 
Here are my current issues with the design:

1. The decoupling of the hashing of an item from the BloomFilter, thus the 
filter does not control the number of indexes generated
2. The use of 32-bit integers for bit indexes
3. Hashing currently requires a full length byte[] to pass to a hash function
4. Thread safety

In order:

1. The decoupling of the hashing of an item from the BloomFilter, thus the 
filter does not control the number of indexes generated

I still think the API needs to have some simple way to use a BloomFilter that 
allows adding objects. Currently you cannot use a BloomFilter on its own. You 
can only use them if you have another BloomFilter or a Hasher. So you need to 
have the ability to create one of those. It is not very in the spirit of a 
collection. A collection should be able to interact directly with what it is a 
collection of. Currently the BloomFilter cannot. We have this situation:

BloomFilter bf = BloomFilters.create(shape); // Construct somehow
T t;
// lots of code (maybe with ByteBuffer): T => byte[] bytes
bf.merge(new DynamicHasher.Builder(bf.getShape()).with(bytes).build());

The final merge operation also relies on the fact that the Hasher will create 
the correct number of indexes for the shape. A contract that it can easily 
violate and break the BloomFilter functionality.

I am thinking about a better way to do this. I like the typed BloomFilter<T> 
approach in Guava. To me it makes more sense for a BloomFilter in 
commons-collections to be typed like all other collections. Having an untyped 
raw interface is strange for a generic collection framework. I think we need an 
abstraction layer between the raw processing of indexes done in the BloomFilter 
and the objects T that you put in.

If you wish to maintain flexibility of defining Hashing then perhaps we extend 
the functionality in the Guava approach (which uses a fixed internal index 
generator) and allow the generator of indexes to be specified:

interface IndexGenerator<T> {
    // infinite stream of 64-bit longs. Not yet mapped to a BloomFilter shape.
    LongSupplier generate(T item);
}

IndexGenerator<T> generator = ...;
BloomFilter<T> bf = BloomFilter.create(generator, shape);
T t;
bf.add(t);
bf.contains(t);

This moves the hashing of the items inside the BloomFilter but allows the 
method to hash them to be configured as it is inside the IndexGenerator. This 
encapsulation prevents the filter state from being abused by adding hashers, 
which as discussed before should adhere to a contract to generate the correct 
number of indexes in the correct range for the number of bits but do not have 
to. If the BloomFilter is controlling the mapping of a stream of indexes to the 
appropriate range and number then this prevents state corruption.

You previously mentioned separation of concerns as a reason to have a 
BloomFilter and the process for creation of a Hasher separate. This was to 
allow you to send a specialised Hasher over a network. Under this system you 
can create a BloomFilter<long[]> that stores a representation of a cyclic hash. 
The IndexGenerator<T> would be a IndexGenerator<long[]> that simply cycles the 
pair of longs to create indexes. Sending the pre-hashed long[2] over a network 
is simple. You thus set up a remote hash algorithm to convert your items to 
long[2] and send them for remote query.

I’ll show an example later of how this approach could be incorporated on top of 
out current code base.

2. The use of 32-bit integers for bit indexes

The lack of 64-bit indexes for the bit positions in a filter prevents 
construction of very large filters. This is a limitation as shown here:

P = 1e-6
M = 2^31, N = 74,681,641
M= 2^31 * 64, N =  4,779,624,982

So by using a full length long[] representation that could be addressed in 
memory from a long index divided by 64 allows you to increase capacity of an 
array base filter by 64-fold. This pushes it into a range where the number of 
items that can be stored is much larger than that fully supported by the JDK 
collections (which use 32-bit integers for size). It future proofs the API for 
expansion when RAM is cheap. The actual change to the code underneath is 
minimal. It would eliminate some data structures that we currently have e.g. 
for the CountingBloomFilter, but these could be marked to throw exceptions if 
the size is too big. 

It does make the API suitable for huge filters that can for example be used to 
check database keys exist before a call to a database.

I suspect the use or 32-bit indexes was to allow the use of Set<Integer> and 
BitSet. We no longer use Set<Integer> for the counting bloom filter. A 
TreeSet<Long> would function just as well where we do use it. We can create a 
custom implementation of BitSet to allow long indexes. The BitSet has overhead 
due to its dynamic sizing that we can remove. A simple BitArray implementation 
would be easy to do.

3. Hashing currently requires a full length byte[] to pass to a hash function

Hashing requires a preallocated byte[]. I would prefer to see hashing as a 
dynamic process where bytes are fed into an algorithm. For example MurmurHash3 
32-bit algorithm only requires 4 bytes at a time. The 128-bit algorithm 
requires 2 longs. Currently the Objects32 hasher uses 1 byte at a time so could 
be totally online.

The Guava code has the concept of a Hasher as a sink that can accepts data of 
any primitive type. In their implementation bytes are always processed in the 
input order as the code is also tied to their implementation of various hash 
algorithms so the output must be correct. However there is no reason that the 
bytes have to be processed in order. This allows for an optimisation where a 
DynamicHasher can process bytes as it chooses such as immediately process 
int/float and long/double input and for smaller input put those in a temp 
storage until 4 bytes have been received. The DynamicHasher need only ensure 
all bytes contribute to the final hash but can choose the order in which bytes 
are processed.

I’ll show an example later of how this could be incorporated.

4. Thread safety

Currently the hash functions are not thread safe. If the function is moved 
inside a BloomFilter then the architecture must exist to create the stream of 
indexes for each input object in a thread safe manner. Thus multiple threads 
would be able to query the same BloomFilter instance. If we create the correct 
design this feature will be naturally included. Obviously adding to a 
BloomFilter may not be concurrent (it would be for the standard implementation 
using a bit array) but if a filter is just a gateway filter which never changes 
then I think concurrent query should be expected to work.

Here is a modification to the current code using an abstraction layer to 
convert items T to indexes. I was thinking about something like the following 
skeleton:

// Represent a hash
interface Hash {
    // Size of bits of the hash that is created
    int bits();
    // 64 bits of the hash (little-endian), starting from offset * 64.
    long getLongBits(int offset);
}

// Generate a Hash from a pre-existing byte[].
// (This is modified from the current HashFunction.)
interface StaticHashFunction {
    Hash apply(byte[], int seed);
}

// Generate indexes for an item.
// Left to the BloomFilter to call this appropriately for its own shape.
// The IndexGenerator would be constructed with:
// - A system to convert T to data for hashing (e.g. Function<T, byte[]>)
// - A system to convert data to a Hash (e.g. StaticHashFunction)
// - Other details of how a hash is incremented, e.g. cyclic/iterative process 
type
interface IndexGenerator<T> {
    // The supplier provides an indefinite stream
    LongSupplier generate(T item);
}

// Note:
// Initial implementation of IndexGenerator would be
// StaticIndexGenerator using StaticHashFunction with variants for iterative or 
cyclic.
// In either case the user supplies the Function<T, byte[]> to create a byte[] 
from
// the object. This is then used to create a Hash for each call of the 
LongSupplier
// or using a cyclic process from a single call to create the base Hash.

// Note: A PrehashedCyclicIndexGenerator<long[]> class can be used to just 
generate
// the indexes directly using the first 2 longs in the long[] object using a 
cyclic function.
// This allows constructing a BloomFilter<long[]> and storing pre-hashed items 
in it. The
// hashed representation is simple to send over a network (it is only 2 longs) 
allowing remote
// hashing and query.

// Typed to items
interface BloomFilter<T> {
    boolean add(T item);
}

// To use it you require the function to convert objects to data
Function<T, byte[]> converter = t -> t.getBytes();

IndexGenerator<T> hasher = IndexGeneratores.create(converter, 
staticHashFunction, processType);
BloomFilter<T> bf = new BloomFilter<>(hasher, shape);

// or common use case with default hash function and process type
BloomFilter<T> bf = BloomFilters.create(n, p, converter);

// common API
T t;
bf.add(t);
bf.contains(t);

The following is an extension to allow dynamic hashing similar to the Guava API

// Accepts primitive data
interface DataSink {
 DataSink putByte(byte b);
 DataSink putBytes(byte[] b, int offset, int length);
 DataSink putChars(CharSequence cs, Charset encoding);
 DataSink putUnencodedChars(CharSequence cs);
 DataSink putInt(int i);
 // All other primitives
}

// Specify how item T is decomposed into primitive data
interface Pipe<T> {
 void connect(T item, DataSink sink);
}

// Accepts data and dynamically hash it. 
interface Hasher extends DataSink {
 Hash build();
}

// Provides an implementation to create a Hash dynamically
interface DynamicHashFunction {
 Hasher newHasher(int seed);
 // Size of bits of the hash that is created
 int bits();
}

// To use it you require the pipe implementation to convert objects to data
Pipe<T> pipe = (t, sink) -> sink.putInt(t.getIntProperty());

// explicit control over hash function and conversion of hashes to indexes
IndexGenerator<T> hasher = IndexGeneratores.create(pipe, dyanmicHashFunction, 
processType);
BloomFilter<T> bf = new BloomFilter<>(hasher, shape);

// or common use case with default hash function and process type
BloomFilter<T> bf = BloomFilters.create(n, p, pipe);

With various factory methods to use defaults for a hash function and process 
type and a few implementations of Pipe for common things (e.g. String, Integer).

There are lots of holes to fill in. But it allows you to directly add items T 
to a BloomFilter<T> and control the hash function and the iterative process.

One main issue is how to deal with filter compatibility. We can compare filter 
properties k and m easily. But how to compare the IndexGenerator<T> used by the 
filter? IMO the current HashFunctionIdentity is subject to error. No system is 
perfect. Perhaps we go with the simple option of using Object.equals(Object) to 
test equality of the IndexGenerator. In the common use case you will have the 
same object for your IndexGenerator<T> when merging two filters. The equals 
will be fast. All the implementations we provide can override equals to make 
this efficient using an internal signature. The exact method can be private 
until we have the rest of the framework complete. 

There are issues. There is a lot of object creation. I’d like to minimise this. 
But I don’t see how you can have a flexible API without object creation. For 
example you could make the BloomFilter convert the Hash to indexes and avoid 
the index generator object. But then you lose the ability to use an item of 
type long[] as a prehashed item for conversion with a customised index 
generator.

If this is not the best way then I still think the API needs to have some 
simple way to use a BloomFilter that allows adding objects. Currently you 
cannot use a BloomFilter on its own. You can only use them if you have another 
BloomFilter or a Hasher. So you need to have the ability to create one of 
those. It is not very in the spirit of a collection. A collection should be 
able to interact directly with what it is a collection of. Currently the 
BloomFilter cannot. So this suggestion is to make the conversion of objects to 
data for hashing part of the API.

Note also that here we drop the ability to query using a Hasher build with 
multiple objects as it is covered by query using a BloomFilter<T> as a 
collection of T. 

These are just thoughts. But I’d like you opinion on whether we can make this 
work for all the scenarios you are applying this to. I think it would work for 
your common use cases of storing String and the remote filter query. The goal 
would be to make a BloomFilter<T> like a fuzzy Set<T> (may or may not contain 
T) and just as easy to use. Otherwise I think you risk making the API too 
complex for the common use case. That would be a user wants to store items T 
and knows the fields in T that uniquely define the item. They just build a 
function to create a byte[] representation of the object or a pipe to send data 
from T to a downstream receiver that will collate the data. The reset of the 
process to create indexes from the item data is provided by the framework.

Alex

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org

Re: [BloomFilters] changes to BloomFilter

Reply via email to