[collections] Multidimensional Bloom Filters (was BloomFilterExtractor.flatten)

Claude Warren Sat, 12 Oct 2024 02:05:56 -0700

> ...
> We can delay MultidimensionalBloomFilter unless introducing it later would
> break binary compatibility. How beneficial would introducing this interface
> for users?
>
> Gary
>
>
> On Sun, Oct 6, 2024, 10:58 AM Claude Warren <[email protected]> wrote:
>
> > This is starting tondelve into the realm of multidimensional bloom
> filters.
> ...
> > We could define an interface MultidimensionalBloomFilter that could be
> > implemented by any Bloom filter that comprises multiple filters and givr
> it
> > the method flatten().  But this is a can of worms I would like to keep
> > sealed for a bit longer.
> >
> > Claude
>

I have been meaning to answer this question for a week now but time always
seems to get away and it is a rather long discussion.

Multidimensional Bloom filters are basically any collection of Bloom
filters. However they have very different capabilities and usages. I wrote
an introduction to Multidimensional Bloom filters (
https://github.com/apache/commons-collections/blob/master/src/site/markdown/bloomFilters/multidimensional.md)
but I can’t seem to find it online. Perhaps it does not get published to
the site, though the site descriptions are out of date.

I identify two categories of Bloom filters: Gatekeeper and Representational.

Gatekeeper is the standard use where the key for an item is hashed and
placed in a Bloom filter and when an expensive lookup is required the Bloom
filter is consulted first to see if the lookup should be performed.

Representational filters encode properties of objects and are then used to
find objects with the same properties. This use case is found in genomics,
encrypted database searching, and some proposals for performing pattern
matching in the presence of wildcards. In the representational case the
Bloom filter is associated with a data record of some sort.

Multidimensional Bloom filters can contain either category.

As a couple of examples:

Example 1: Tracking ageing records on disk

A streaming data system tracks active records by storing them on disk for
an hour. Every 5 minutes it flushes any record that is older than 1 hour. A
layered Bloom filter (a type of Multidimensional Bloom filter) is
constructed and a new layer added whenever one becomes full (the max number
of items have been added) or 5 minutes has elapsed (therefore they must
have timestamps associated with them). Bloom filters are removed whenever
their timestamp expires.

When a record is to be added the layered Bloom filter is checked to see if
the object is on the disk. If not, it is added. If so, the disk is searched
for the record. Note, that if each Bloom filter in the layered Bloom filter
is associated with a separate file then identifying the layers that match
the data identifies which files to search further reducing the search time.
In this case the layers are gatekeeper filters.

Example 2: locating potential pattern matches in the presence of wildcards

A system of patterns is stored where each pattern may have a wildcard
matching one or more characters. Let’s assume the wildcard is ‘*’ and the
patterns are restricted to upper case Latin characters A-Z, space, ‘*’, and
digits 0-9. When a pattern is added to the system every pair of characters
in [A-Z0-9] is hashed and added to a Bloom filter that is associated with
the pattern. The Bloom filter is then placed in a multidimensional Bloom
filter.

Queries are then textual strings for which potential matches are located.
So the query is hashed in the same way and each Bloom filter in the
multidimensional Bloom filter is checked to see if it matches. If it does
then the associated pattern is checked to see if it matches.

So the string “how now brown cow” would yield a Bloom filter that was the
combination of the hashes of “ho”, “ow”, “no”, “br”, “wn”, and “co”. The
pattern “how now * cow” would be indexed with hashes of “ho”, “ow” “no”,
“co” thus it would match the Bloom filter for the phrase. This is similar
to how some genomic searching works.

Example 3: Locating objects with specific properties

In this example assume a number of objects with common properties names. We
construct a Bloom filter for each object. The Bloom filter comprises the
hashes of the values of the properties (potentially with different seeds to
differentiate between the property names). We write the objects into a data
storage and associate the storage with their Bloom filters. The Bloom
filter is then placed into a multidimensional Bloom filter.

If we want to find all the objects where X=five and Y=blue and Z=yesterday
we simply generate the hashes and create the Bloom filter and then search
the multidimensional Bloom filter for matches. This is the solution found
in searching encrypted data, where the clear text values are hashed and the
encrypted data written to storage.

So we have multidimensional Bloom filters that store gatekeeper type
filters (example 1) and others that store representational (example 3).
Example 2 seems to fall somewhere on the continuum between the two.

In almost all cases you need to be able to search the filter for matching
filters. In some cases you need to flatten the representation down to a
Gatekeeper style. The multidimensional Bloom filter documentation noted
above discusses several implementations of multidimensional Bloom filters.

In defining a MultidimensionalBloomFilter interface we would have to
consider the various uses. It seems that either we assume the filters being
added to the multidimensional filter are WrappedBloomFilters so that they
contain the additional data, or when they are added the user may specify an
associated Object. Perhaps we define a RepresentationalBloomFilter<T> that
implements BloomFilter and stores a <T> object. Then this can be added to
the multidimensional filter.

In any case the multidimensional filter needs to be able to search for
matching filters that are stored within and return some sort of index.

I have seen implementations of multidimensional Bloom filters that act like
a database into which a Bloom filter is stored. When a filter is stored an
integer “index” is returned. The developer then uses that to store the
object in a database or similar repository. When searching, zero or more
matching indexes are returned. Deletion requires that the index be passed.

So we perhaps could implement a multidimensional filter that looks like
this:

interface MultidimensionalBloomFilter {

int add(BloomFilter filter);

int[] search(BloomFilter filter);

int[] locate(BloomFilter filter);

boolean delete(int idx); // return true if there was a change

BloomFilter flatten();

}

The difference between the search() and locate() methods is that search()
does the Bloom filter matching while locate only returns exact matches. I
don't know if this is necessary but it may make some housekeeping much
easier.

Once the filters enter the multidimensional Bloom filter they are no longer
visible to the outside, that is this interface does not provide a mechanism
to return the Bloom filter for an index – I don’t know if that is necessary.

I am reluctant to add data to the Bloom filters within the
MultidimensionalBloomFilter because many of the implementations will
probably have to store the filter representation on disk and serializing an
associated arbitrary data object seems a bit much and outside the scope.

I hope this clarifies the problem space and will bring clarity to any
follow on discussions.

Claude

[collections] Multidimensional Bloom Filters (was BloomFilterExtractor.flatten)

Reply via email to