Hi,

The MD refers several time to "10K filters". Where does this limit come
from? If there is a baseline CPU and RAM combination this is based on, the
document should state it IMO. Or, is it based on the width of a Java int or
a Java long?

TY,
Gary

On Sat, Oct 12, 2024, 5:07 AM Claude Warren <cla...@xenei.com> wrote:

> > ...
> > We can delay MultidimensionalBloomFilter unless introducing it later
> would
> > break binary compatibility. How beneficial would introducing this
> interface
> > for users?
> >
> > Gary
> >
> >
> > On Sun, Oct 6, 2024, 10:58 AM Claude Warren <cla...@xenei.com> wrote:
> >
> > > This is starting tondelve into the realm of multidimensional bloom
> > filters.
> > ...
> > > We could define an interface MultidimensionalBloomFilter that could be
> > > implemented by any Bloom filter that comprises multiple filters and
> givr
> > it
> > > the method flatten().  But this is a can of worms I would like to keep
> > > sealed for a bit longer.
> > >
> > > Claude
> >
>
> I have been meaning to answer this question for a week now but time always
> seems to get away and it is a rather long discussion.
>
> Multidimensional Bloom filters are basically any collection of Bloom
> filters. However they have very different capabilities and usages. I wrote
> an introduction to Multidimensional Bloom filters (
>
> https://github.com/apache/commons-collections/blob/master/src/site/markdown/bloomFilters/multidimensional.md
> )
> but I can’t seem to find it online. Perhaps it does not get published to
> the site, though the site descriptions are out of date.
>
> I identify two categories of Bloom filters: Gatekeeper and
> Representational.
>
> Gatekeeper is the standard use where the key for an item is hashed and
> placed in a Bloom filter and when an expensive lookup is required the Bloom
> filter is consulted first to see if the lookup should be performed.
>
> Representational filters encode properties of objects and are then used to
> find objects with the same properties. This use case is found in genomics,
> encrypted database searching, and some proposals for performing pattern
> matching in the presence of wildcards. In the representational case the
> Bloom filter is associated with a data record of some sort.
>
> Multidimensional Bloom filters can contain either category.
>
> As a couple of examples:
>
> Example 1: Tracking ageing records on disk
>
> A streaming data system tracks active records by storing them on disk for
> an hour. Every 5 minutes it flushes any record that is older than 1 hour. A
> layered Bloom filter (a type of Multidimensional Bloom filter) is
> constructed and a new layer added whenever one becomes full (the max number
> of items have been added) or 5 minutes has elapsed (therefore they must
> have timestamps associated with them). Bloom filters are removed whenever
> their timestamp expires.
>
> When a record is to be added the layered Bloom filter is checked to see if
> the object is on the disk. If not, it is added. If so, the disk is searched
> for the record. Note, that if each Bloom filter in the layered Bloom filter
> is associated with a separate file then identifying the layers that match
> the data identifies which files to search further reducing the search time.
> In this case the layers are gatekeeper filters.
>
> Example 2: locating potential pattern matches in the presence of wildcards
>
> A system of patterns is stored where each pattern may have a wildcard
> matching one or more characters. Let’s assume the wildcard is ‘*’ and the
> patterns are restricted to upper case Latin characters A-Z, space, ‘*’, and
> digits 0-9. When a pattern is added to the system every pair of characters
> in [A-Z0-9] is hashed and added to a Bloom filter that is associated with
> the pattern. The Bloom filter is then placed in a multidimensional Bloom
> filter.
>
> Queries are then textual strings for which potential matches are located.
> So the query is hashed in the same way and each Bloom filter in the
> multidimensional Bloom filter is checked to see if it matches. If it does
> then the associated pattern is checked to see if it matches.
>
> So the string “how now brown cow” would yield a Bloom filter that was the
> combination of the hashes of “ho”, “ow”, “no”, “br”, “wn”, and “co”. The
> pattern “how now * cow” would be indexed with hashes of “ho”, “ow” “no”,
> “co” thus it would match the Bloom filter for the phrase. This is similar
> to how some genomic searching works.
>
> Example 3: Locating objects with specific properties
>
> In this example assume a number of objects with common properties names. We
> construct a Bloom filter for each object. The Bloom filter comprises the
> hashes of the values of the properties (potentially with different seeds to
> differentiate between the property names). We write the objects into a data
> storage and associate the storage with their Bloom filters. The Bloom
> filter is then placed into a multidimensional Bloom filter.
>
> If we want to find all the objects where X=five and Y=blue and Z=yesterday
> we simply generate the hashes and create the Bloom filter and then search
> the multidimensional Bloom filter for matches. This is the solution found
> in searching encrypted data, where the clear text values are hashed and the
> encrypted data written to storage.
>
>
> So we have multidimensional Bloom filters that store gatekeeper type
> filters (example 1) and others that store representational (example 3).
> Example 2 seems to fall somewhere on the continuum between the two.
>
> In almost all cases you need to be able to search the filter for matching
> filters. In some cases you need to flatten the representation down to a
> Gatekeeper style. The multidimensional Bloom filter documentation noted
> above discusses several implementations of multidimensional Bloom filters.
>
> In defining a MultidimensionalBloomFilter interface we would have to
> consider the various uses. It seems that either we assume the filters being
> added to the multidimensional filter are WrappedBloomFilters so that they
> contain the additional data, or when they are added the user may specify an
> associated Object. Perhaps we define a RepresentationalBloomFilter<T> that
> implements BloomFilter and stores a <T> object. Then this can be added to
> the multidimensional filter.
>
> In any case the multidimensional filter needs to be able to search for
> matching filters that are stored within and return some sort of index.
>
> I have seen implementations of multidimensional Bloom filters that act like
> a database into which a Bloom filter is stored.  When a filter is stored an
> integer “index” is returned. The developer then uses that to store the
> object in a database or similar repository. When searching, zero or more
> matching indexes are returned. Deletion requires that the index be passed.
>
> So we perhaps could implement a multidimensional filter that looks like
> this:
>
> interface MultidimensionalBloomFilter {
>
> int add(BloomFilter filter);
>
> int[] search(BloomFilter filter);
>
> int[] locate(BloomFilter filter);
>
> boolean delete(int idx); // return true if there was a change
>
> BloomFilter flatten();
>
> }
>
> The difference between the search() and locate() methods is that search()
> does the Bloom filter matching while locate only returns exact matches.  I
> don't know if this is necessary but it may make some housekeeping much
> easier.
>
> Once the filters enter the multidimensional Bloom filter they are no longer
> visible to the outside, that is this interface does not provide a mechanism
> to return the Bloom filter for an index – I don’t know if that is
> necessary.
>
> I am reluctant to add data to the Bloom filters within the
> MultidimensionalBloomFilter because many of the implementations will
> probably have to store the filter representation on disk and serializing an
> associated arbitrary data object seems a bit much and outside the scope.
>
> I hope this clarifies the problem space and will bring clarity to any
> follow on discussions.
>
>
> Claude
>

Reply via email to