In the MD: " This index arises from the observation that no target can match a filter with a lower hamming value."
Shouldn't "hamming" be capitalized? Gary On Sat, Oct 12, 2024, 5:07 AM Claude Warren <cla...@xenei.com> wrote: > > ... > > We can delay MultidimensionalBloomFilter unless introducing it later > would > > break binary compatibility. How beneficial would introducing this > interface > > for users? > > > > Gary > > > > > > On Sun, Oct 6, 2024, 10:58 AM Claude Warren <cla...@xenei.com> wrote: > > > > > This is starting tondelve into the realm of multidimensional bloom > > filters. > > ... > > > We could define an interface MultidimensionalBloomFilter that could be > > > implemented by any Bloom filter that comprises multiple filters and > givr > > it > > > the method flatten(). But this is a can of worms I would like to keep > > > sealed for a bit longer. > > > > > > Claude > > > > I have been meaning to answer this question for a week now but time always > seems to get away and it is a rather long discussion. > > Multidimensional Bloom filters are basically any collection of Bloom > filters. However they have very different capabilities and usages. I wrote > an introduction to Multidimensional Bloom filters ( > > https://github.com/apache/commons-collections/blob/master/src/site/markdown/bloomFilters/multidimensional.md > ) > but I can’t seem to find it online. Perhaps it does not get published to > the site, though the site descriptions are out of date. > > I identify two categories of Bloom filters: Gatekeeper and > Representational. > > Gatekeeper is the standard use where the key for an item is hashed and > placed in a Bloom filter and when an expensive lookup is required the Bloom > filter is consulted first to see if the lookup should be performed. > > Representational filters encode properties of objects and are then used to > find objects with the same properties. This use case is found in genomics, > encrypted database searching, and some proposals for performing pattern > matching in the presence of wildcards. In the representational case the > Bloom filter is associated with a data record of some sort. > > Multidimensional Bloom filters can contain either category. > > As a couple of examples: > > Example 1: Tracking ageing records on disk > > A streaming data system tracks active records by storing them on disk for > an hour. Every 5 minutes it flushes any record that is older than 1 hour. A > layered Bloom filter (a type of Multidimensional Bloom filter) is > constructed and a new layer added whenever one becomes full (the max number > of items have been added) or 5 minutes has elapsed (therefore they must > have timestamps associated with them). Bloom filters are removed whenever > their timestamp expires. > > When a record is to be added the layered Bloom filter is checked to see if > the object is on the disk. If not, it is added. If so, the disk is searched > for the record. Note, that if each Bloom filter in the layered Bloom filter > is associated with a separate file then identifying the layers that match > the data identifies which files to search further reducing the search time. > In this case the layers are gatekeeper filters. > > Example 2: locating potential pattern matches in the presence of wildcards > > A system of patterns is stored where each pattern may have a wildcard > matching one or more characters. Let’s assume the wildcard is ‘*’ and the > patterns are restricted to upper case Latin characters A-Z, space, ‘*’, and > digits 0-9. When a pattern is added to the system every pair of characters > in [A-Z0-9] is hashed and added to a Bloom filter that is associated with > the pattern. The Bloom filter is then placed in a multidimensional Bloom > filter. > > Queries are then textual strings for which potential matches are located. > So the query is hashed in the same way and each Bloom filter in the > multidimensional Bloom filter is checked to see if it matches. If it does > then the associated pattern is checked to see if it matches. > > So the string “how now brown cow” would yield a Bloom filter that was the > combination of the hashes of “ho”, “ow”, “no”, “br”, “wn”, and “co”. The > pattern “how now * cow” would be indexed with hashes of “ho”, “ow” “no”, > “co” thus it would match the Bloom filter for the phrase. This is similar > to how some genomic searching works. > > Example 3: Locating objects with specific properties > > In this example assume a number of objects with common properties names. We > construct a Bloom filter for each object. The Bloom filter comprises the > hashes of the values of the properties (potentially with different seeds to > differentiate between the property names). We write the objects into a data > storage and associate the storage with their Bloom filters. The Bloom > filter is then placed into a multidimensional Bloom filter. > > If we want to find all the objects where X=five and Y=blue and Z=yesterday > we simply generate the hashes and create the Bloom filter and then search > the multidimensional Bloom filter for matches. This is the solution found > in searching encrypted data, where the clear text values are hashed and the > encrypted data written to storage. > > > So we have multidimensional Bloom filters that store gatekeeper type > filters (example 1) and others that store representational (example 3). > Example 2 seems to fall somewhere on the continuum between the two. > > In almost all cases you need to be able to search the filter for matching > filters. In some cases you need to flatten the representation down to a > Gatekeeper style. The multidimensional Bloom filter documentation noted > above discusses several implementations of multidimensional Bloom filters. > > In defining a MultidimensionalBloomFilter interface we would have to > consider the various uses. It seems that either we assume the filters being > added to the multidimensional filter are WrappedBloomFilters so that they > contain the additional data, or when they are added the user may specify an > associated Object. Perhaps we define a RepresentationalBloomFilter<T> that > implements BloomFilter and stores a <T> object. Then this can be added to > the multidimensional filter. > > In any case the multidimensional filter needs to be able to search for > matching filters that are stored within and return some sort of index. > > I have seen implementations of multidimensional Bloom filters that act like > a database into which a Bloom filter is stored. When a filter is stored an > integer “index” is returned. The developer then uses that to store the > object in a database or similar repository. When searching, zero or more > matching indexes are returned. Deletion requires that the index be passed. > > So we perhaps could implement a multidimensional filter that looks like > this: > > interface MultidimensionalBloomFilter { > > int add(BloomFilter filter); > > int[] search(BloomFilter filter); > > int[] locate(BloomFilter filter); > > boolean delete(int idx); // return true if there was a change > > BloomFilter flatten(); > > } > > The difference between the search() and locate() methods is that search() > does the Bloom filter matching while locate only returns exact matches. I > don't know if this is necessary but it may make some housekeeping much > easier. > > Once the filters enter the multidimensional Bloom filter they are no longer > visible to the outside, that is this interface does not provide a mechanism > to return the Bloom filter for an index – I don’t know if that is > necessary. > > I am reluctant to add data to the Bloom filters within the > MultidimensionalBloomFilter because many of the implementations will > probably have to store the filter representation on disk and serializing an > associated arbitrary data object seems a bit much and outside the scope. > > I hope this clarifies the problem space and will bring clarity to any > follow on discussions. > > > Claude >