> ... > We can delay MultidimensionalBloomFilter unless introducing it later would > break binary compatibility. How beneficial would introducing this interface > for users? > > Gary > > > On Sun, Oct 6, 2024, 10:58 AM Claude Warren <cla...@xenei.com> wrote: > > > This is starting tondelve into the realm of multidimensional bloom > filters. > ... > > We could define an interface MultidimensionalBloomFilter that could be > > implemented by any Bloom filter that comprises multiple filters and givr > it > > the method flatten(). But this is a can of worms I would like to keep > > sealed for a bit longer. > > > > Claude >
I have been meaning to answer this question for a week now but time always seems to get away and it is a rather long discussion. Multidimensional Bloom filters are basically any collection of Bloom filters. However they have very different capabilities and usages. I wrote an introduction to Multidimensional Bloom filters ( https://github.com/apache/commons-collections/blob/master/src/site/markdown/bloomFilters/multidimensional.md) but I can’t seem to find it online. Perhaps it does not get published to the site, though the site descriptions are out of date. I identify two categories of Bloom filters: Gatekeeper and Representational. Gatekeeper is the standard use where the key for an item is hashed and placed in a Bloom filter and when an expensive lookup is required the Bloom filter is consulted first to see if the lookup should be performed. Representational filters encode properties of objects and are then used to find objects with the same properties. This use case is found in genomics, encrypted database searching, and some proposals for performing pattern matching in the presence of wildcards. In the representational case the Bloom filter is associated with a data record of some sort. Multidimensional Bloom filters can contain either category. As a couple of examples: Example 1: Tracking ageing records on disk A streaming data system tracks active records by storing them on disk for an hour. Every 5 minutes it flushes any record that is older than 1 hour. A layered Bloom filter (a type of Multidimensional Bloom filter) is constructed and a new layer added whenever one becomes full (the max number of items have been added) or 5 minutes has elapsed (therefore they must have timestamps associated with them). Bloom filters are removed whenever their timestamp expires. When a record is to be added the layered Bloom filter is checked to see if the object is on the disk. If not, it is added. If so, the disk is searched for the record. Note, that if each Bloom filter in the layered Bloom filter is associated with a separate file then identifying the layers that match the data identifies which files to search further reducing the search time. In this case the layers are gatekeeper filters. Example 2: locating potential pattern matches in the presence of wildcards A system of patterns is stored where each pattern may have a wildcard matching one or more characters. Let’s assume the wildcard is ‘*’ and the patterns are restricted to upper case Latin characters A-Z, space, ‘*’, and digits 0-9. When a pattern is added to the system every pair of characters in [A-Z0-9] is hashed and added to a Bloom filter that is associated with the pattern. The Bloom filter is then placed in a multidimensional Bloom filter. Queries are then textual strings for which potential matches are located. So the query is hashed in the same way and each Bloom filter in the multidimensional Bloom filter is checked to see if it matches. If it does then the associated pattern is checked to see if it matches. So the string “how now brown cow” would yield a Bloom filter that was the combination of the hashes of “ho”, “ow”, “no”, “br”, “wn”, and “co”. The pattern “how now * cow” would be indexed with hashes of “ho”, “ow” “no”, “co” thus it would match the Bloom filter for the phrase. This is similar to how some genomic searching works. Example 3: Locating objects with specific properties In this example assume a number of objects with common properties names. We construct a Bloom filter for each object. The Bloom filter comprises the hashes of the values of the properties (potentially with different seeds to differentiate between the property names). We write the objects into a data storage and associate the storage with their Bloom filters. The Bloom filter is then placed into a multidimensional Bloom filter. If we want to find all the objects where X=five and Y=blue and Z=yesterday we simply generate the hashes and create the Bloom filter and then search the multidimensional Bloom filter for matches. This is the solution found in searching encrypted data, where the clear text values are hashed and the encrypted data written to storage. So we have multidimensional Bloom filters that store gatekeeper type filters (example 1) and others that store representational (example 3). Example 2 seems to fall somewhere on the continuum between the two. In almost all cases you need to be able to search the filter for matching filters. In some cases you need to flatten the representation down to a Gatekeeper style. The multidimensional Bloom filter documentation noted above discusses several implementations of multidimensional Bloom filters. In defining a MultidimensionalBloomFilter interface we would have to consider the various uses. It seems that either we assume the filters being added to the multidimensional filter are WrappedBloomFilters so that they contain the additional data, or when they are added the user may specify an associated Object. Perhaps we define a RepresentationalBloomFilter<T> that implements BloomFilter and stores a <T> object. Then this can be added to the multidimensional filter. In any case the multidimensional filter needs to be able to search for matching filters that are stored within and return some sort of index. I have seen implementations of multidimensional Bloom filters that act like a database into which a Bloom filter is stored. When a filter is stored an integer “index” is returned. The developer then uses that to store the object in a database or similar repository. When searching, zero or more matching indexes are returned. Deletion requires that the index be passed. So we perhaps could implement a multidimensional filter that looks like this: interface MultidimensionalBloomFilter { int add(BloomFilter filter); int[] search(BloomFilter filter); int[] locate(BloomFilter filter); boolean delete(int idx); // return true if there was a change BloomFilter flatten(); } The difference between the search() and locate() methods is that search() does the Bloom filter matching while locate only returns exact matches. I don't know if this is necessary but it may make some housekeeping much easier. Once the filters enter the multidimensional Bloom filter they are no longer visible to the outside, that is this interface does not provide a mechanism to return the Bloom filter for an index – I don’t know if that is necessary. I am reluctant to add data to the Bloom filters within the MultidimensionalBloomFilter because many of the implementations will probably have to store the filter representation on disk and serializing an associated arbitrary data object seems a bit much and outside the scope. I hope this clarifies the problem space and will bring clarity to any follow on discussions. Claude