Re: [BloomFilters] changes to BloomFilter

Claude Warren Tue, 17 Mar 2020 08:42:27 -0700

I agree with the HashFunction changes.

The only place I know of where more items are added to a hasher is in our
test code.  So I think just requiring Builder.build() to do a reset is
correct solution.
I think Builder should have
with(byte[])
with(byte[], int offset, int len )
with(String)


I find that I use with(String) more than any other with() method.

If you want to add with(ByteBuffer) with the implementation you have above
I think that would be a good addition.  Anything else can probably be done
with a Decorator.

I would like to see https://github.com/apache/commons-collections/pull/131
get merged so that we can have more than one example of a hasher that
actually hashes.




On Tue, Mar 17, 2020 at 1:53 PM Alex Herbert <alex.d.herb...@gmail.com>
wrote:

>
>
> > On 17 Mar 2020, at 11:08, Claude Warren <cla...@xenei.com> wrote:
> >
> > On Tue, Mar 17, 2020 at 12:28 AM Alex Herbert <alex.d.herb...@gmail.com
> <mailto:alex.d.herb...@gmail.com>>
> > wrote:
> >
> >>
> >>> The Shape tells the hasher how many hash functions to apply to each
> item.
> >>
> >> OK. This is may misunderstanding. There is a contract that the Hasher is
> >> expected to fulfil but it is just not recorded in the javadoc. I can
> update
> >> the docs to indicate that:
> >>
> >> "A Hasher represents items of arbitrary byte size as a byte
> representation
> >> of fixed size (a hash). The hash for each item is created using a hash
> >> function; use of different seeds allows generation of different hashes
> for
> >> the same item. The hashes can be dynamically converted into the bit
> index
> >> representation used by a Bloom filter. The shape of the Bloom filter
> >> defines the number of indexes per item and the range of the indexes. The
> >> hasher functions to generate the correct number of indexes in the range
> >> required by the Bloom filter for each item it represents.
> >>
> >> Note that the process of generating hashes and mapping them to a Bloom
> >> filter shape may create duplicate indexes. The hasher may generate fewer
> >> than the required number of hash functions per item if duplicates have
> been
> >> removed."
> >>
> >> Looks Good
>
> Added to master.
>
> >
> >>> The Shape number of items is how many items are expected to be in the
> >> final
> >>> Bloom filter, it is more the expected value not a hard limit.
> >>
> >> Yes. As discussed before this is not actually required for a Bloom
> filter
> >> to function, it is required to maintain the intended purpose of the
> filter
> >> when it was constructed.
> >>
> >>
> >>
> >>>
> >>> The static hasher for example will not return duplicates so it might
> >> appear
> >>> that it has not respected the number of functions.  In addition there
> is
> >> no
> >>> indication from the hasher how many items it contains..
> >>
> >> Yes. So we state that the hasher represents one or more items.
> >>
> >>>
> >>> The inputs to the hash.builder are byte buffers that are fed to the
> hash
> >>> algorithm.  They are inputs to that algorithm.  So primitive types
> would
> >>> simply convert from the primitive type to its byte buffer
> representation.
> >>> Is that what you meant?
> >>
> >> I was unclear on the purpose of the Hasher.Builder. It seemed
> incomplete.
> >> If the builder is to add items then it seems strange to have:
> >>
> >> with(byte property)
> >> with(String property)
> >>
> >> It also seems strange to throw 'IllegalStateException if the Hasher is
> >> locked’ without explaining what this means. Is the builder intended to
> be
> >> concurrent? What is ‘locked’? Etc.
> >>
> >
> > Good catch. Documentation error.  Originally the Hasher.Builder was
> locked
> > once it generated a Hasher that was removed but the documentation was not
> > cleaned up.
>
> I’ve removed this from the Hasher.Builder interface.
>
> >
> >
> >> The byte could not possibly represent many meaningful objects. The
> string
> >> is trivially converted to UTF-8 bytes (as is done in the DynamicHasher).
> >> Both these methods could be added to the interface as default methods or
> >> preferrably dropped as they are so trivial.
> >>
> >> I changed the documentation to remove the encoding as UTF-8 requirement
> >> from the with(String) method. It seems like an implementation detail
> and a
> >> Hasher.Builder implementation can decide how to convert the String. It
> is
> >> faster to use UTF-16 bytes for instance. I understand UTF-8 is for
> >> cross-platform standard. But mandating that it has to be done is too
> >> restrictive IMO. It would be better as:
> >>
> >> with(CharSequence, Charset)
> >> withUnencoded(CharSequence)
> >>
> >>
> > The Hasher.Builder is patterned after the cryptography MessageDigest
> > pattern where a number of update() calls are made before a final digest()
> > produces the hash.  Originally it was update() and build() but earlier
> > suggestion was to change the update() to with().
> >
> > Adding with( byte[] buffer, int offset, int len ) makes sense.
> >
> > with(String ) is a convenience method, it was documented as UTF-8 so that
> > if when  trying to build equivalent filters in a different language there
> > was no need to look deep in the hasher code to figure out what to do.
> >
> > with( byte ) is also a convenience method.
> >
> > They are documented in Hasher.Builder to be common across all builders.
> > Obviously different Builder implementations can add other functionality.
> >
> > One could build a decorator that wraps a builder and adds the with(<T>)
> > functionality.
> >
> > The Builder interface is designed to be a balance between minimal
> > implementation need to be functional and an implementation that can
> handle
> > all arbitrary types.
>
> OK. What I think is different from the MessageDigest pattern is that a
> MessageDigest is designed to receive byte data that is composes into a
> single digest representing the combined bytes.
>
> Here the Hasher.Builder is designed to create a Hasher that represents one
> or more items. Each item is a set of byte data. So each method to add data
> is effectively functioning as a complete addition of a single item. This is
> how the implementation of the DynamicHasher works. Each call to add data
> will add a byte[] to the List<byte[]>. When you call iterator(Shape) on the
> resulting Hasher you will get an iterator that generates k indexes per
> byte[] buffer that was added.
>
> Thus my point that adding a single byte is confusing. Not many entire
> objects would have a single byte representation!
>
> It may be simpler to drop the methods with(String) and with(byte). There
> is the option to add:
>
> with(byte[], int offset, int length)
>
> However I am reluctant to do this as it would just be a delay of the
> inevitable copy to a duplicate array of the correct length given that the
> HashFunction.apply(byte[], int) method does not support a range. So you
> would have to extract the entire byte[] anyway.
>
> To aid the process of adding byte[] to the Hasher.Builder we can provided
> a ByteBuilder that would function similarly to a StringBuilder with append
> methods for all sorts of data. However I am wary that the JDK did not make
> ByteBuffer expandable as it impacts performance. A ByteBuilder would
> effectively be a duplicate of ByteBuffer but expandable. So perhaps we add
> instead this method:
>
> with(ByteBuffer)
>
> The method will extract the current contents of the ByteBuffer between the
> current position and the limit into a new byte[] and add it to the builder:
>
> default Builder with(ByteBuffer) {
>     final int remaining = byteBuffer.remaining();
>     final byte[] byteArray = new byte[remaining];
>     byteBuffer.get(byteArray);
>     return with(byteArray);
> }
>
> Thus this is a bridge between hashing byte[] and the ability to hash
> anything, e.g.
>
>             Hasher.Builder builder = ...;
>             ByteBuffer bb = ByteBuffer.allocate(100);
>             bb.putInt(42);
>             bb.putDouble(123.45);
>             bb.flip();
>             builder.with(bb);
>             bb.putInt(3);
>             bb.putLong(78493577979L);
>             bb.flip();
>             Hasher hasher = builder.with(bb).build();
>
> Given the design is fixed on having byte[] to pass entirely to a
> HashFunction then we cannot remove the need to create intermediate byte[]
> storage.
>
> An alternative is to formalise the convertor idea with a typed function:
>
> <T> Builder with(T item, Function<T, byte[]> convertor);
>
> Or provide a Builder decorator (which you discuss later):
>
>     class BuilderDecorator<T> implements Hasher.Builder {
>         Function<T, byte[]> convertor;
>         Hasher.Builder builder;
>
>         BuilderDecorator(Function<T, byte[]> convertor, Hasher.Builder
> builder) {
>             this.convertor = convertor;
>             this.builder = builder;
>         }
>
>         @Override
>         public Hasher build() {
>             return builder.build();
>         }
>
>         @Override
>         public BuilderDecorator<T> with(byte[] property) {
>             builder.with(property);
>             return this;
>         }
>
>         public BuilderDecorator<T> with(T item) {
>             return with(convertor.apply(item));
>         }
>     }
>
> This is for the future perhaps.
>
> >
> >
> >> I was interpreting the Hasher.Builder as a builder of a single byte[]
> for
> >> hashing where you would pass different primitive values or byte[] for
> the
> >> same Object you want to convert. This is essentially a ByteBuffer. But
> if
> >> it is to receive an entire object for each call then (a) it should be
> >> documented as such; (b) it should be simplified to just the byte[]
> method
> >> with perhaps another one/two:
> >>
> >> with(byte[])
> >> with(byte[], int length)
> >> with(T)
> >> with(ByteBuffer)
> >>
> >
> >> Adding the T method would make the interface typed as Hasher.Builder<T>.
> >> It would require a function to convert items T to a byte[]:
> >>
> >>
> > I think the T and ByteBuffer implementations are better left to a
> decorator
> > class.  But that is just my take.  Since the Bloom filter only supports a
> > logical equivalent of equals putting numerical values into a Bloom filter
> > is not that common.  My experience is that most values are Strings.
>
> So we drop all method from the Builder except with(byte[]) and change the
> ‘property’ to ‘item’ to clarify that the byte[] is representing an entire
> item, not a property of a larger item.
>
> >
> > Collection<T> items = ...
> >> BloomFilter bf = …
> >> Function<T, byte[]> converter = …
> >> HashFunction hf = ...
> >>
> >> for (T item : items) {
> >>    bf.merge(new DynamicHasher.Builder<>(hf,
> >> converter).with(item).build());
> >> }
> >>
> >> Or:
> >>
> >> DynamicHasher.Builder<T> builder = new DynamicHasher.Builder<>(hf,
> >> converter);
> >> for (T item : Collection<T>) {
> >>    builder.with(item);
> >> }
> >> bf.merge(builder.build());
> >>
> >> I think the Hasher.Builder interface can be removed. It does not really
> >> add anything to the API without a factory somewhere to be able to create
> >> Hasher.Builder instances since each instance has no methods for reset:
> >>
> >> At one time Builder.build() performed a reset but that was removed by
> > request, I would have no issues putting it back, or adding a reset
> method.
>
> Add:
>
> Builder reset();
> Hasher buildAndReset();
>
> Or force build() to do a reset as well? It is simplest to make build()
> mandate a reset(). Any situations where you would want to do incremental
> builds by adding more items to a builder and getting a bigger Hasher as a
> super-set of the previous Hasher?
>
> >
> > The Hasher.Builder interface defines a common interface across all
> > builders.  So I can write my code to build Bloom filters and swap out the
> > hasher implementation easily as long as I stick to the Builder.Hasher
> > methods.
> >
> > Hasher h = factory.create().with(x).with(y).with(z).build();
> >>
> >> If you do not have either a factory to create a Hasher.Builder or the
> >> ability to reset a Builder then why have a Hasher.Builder interface?
> >> Passing around just a single instance of the builder has limited use. I
> >> would drop the interface and leave it to Hasher implementations to
> define
> >> how they want to be constructed.
> >>
> >>>
> >>> The hasher contract is that it will generate integers in the proper
> range
> >>> and use the proper number of hash functions for each item that was
> added
> >> to
> >>> the builder and that repeated calls to getBits(Shape) will return the
> >> same
> >>> values.
> >>>
> >>> Did I misunderstand something?
> >>
> >> No, I did. Hence the need to clarify all the javadoc.
> >>
> >> What I think we are missing with the Hasher is the simplicity of the
> Guava
> >> implementation. What you ideally would like to do is:
> >>
> >> Collection<T> items = ...
> >> BloomFilter bf = …
> >>
> >> for (T item : items) {
> >>    bf.merge(item);
> >> }
> >>
> >> This change significantly reduces the applicability of this
> implementation
> > of Bloom filter across applications.  I am currently working on several
> > applications where Hashers are sent across a network to query
> repositories.
> >
> > By separating the Hasher from the Filter (separation of concerns) then I
> > don't have to send large objects across the network to do the query.  Nor
> > do I have to expose the properties I am querying for.  In addition the
> > endpoints I am querying are free to resize the bloom filters they store
> as
> > they see fit based on size of collection and other Shape based concerns.
> > With the Guava implementation this is not possible.
> >
> > Note: "properties" in the above paragraph are items placed into a single
> > builder.
>
> So as above, change ‘property’ to ‘item’.
>
> >
> > Currently you have to do something like:
> >>
> >> Collection<T> items = ...
> >> BloomFilter bf = …
> >> Function<T, Hasher> itemToHasher = …
> >>
> >> for (T item : items) {
> >>    bf.merge(itemToHasher.apply(item));
> >> }
> >>
> >> The itemToHasher is effectively an improved DynamicHasher.Builder<T> as
> >> above.
> >>
> >
> > Yes, and simple to construct from the components in the bloomfilter
> > packages.  In addition, I think you have made the assumption that T
> > contains a single value to be added, in which case a Function<T,byte[]>
> > would suffice and, assuming bf is not a CountingBloom filter, the above
> can
> > be written as:
> >
> > Hasher.Builder builder = new ....
> > Function<T,byte[]> f = new ...
> > items.iterator().forEachRemaining( t -> builder.with( f.apply(t) )
> > bf.merge( builder.build() )
> >
> > Additionally if T is an object with multiple elements that are to be
> added
> > to the filter then
> >
> > Hasher.Builder nativeBuilder = new ....
> > DecoratorT builder = new DecoratorT( nativeBuilder ) // adds the with(T)
> > method
> > items.iterator().forEachRemaining( t -> builder.with( t ) )
> > bf.merge( builder.build() )
> >
> > there is no difference in result between
> > -
> > loop {
> > bf.merge( new Builder().with(x).build() )
> > }
> >
> > -- and --
> >
> > new Builder()
> > loop {
> > builder.with( x )
> > }
> > bf.merge( builder.build() );
> >
> >
> >> It would also be possible for the itemToHasher to recycle byte[] space
> by
> >> returning the same Hasher object that has been reset and filled with the
> >> next item. This would not be thread-safe but would save on intermediate
> >> storage.
> >>
> >> All of this still fixes on having a byte[] representation to feed to a
> >> HashFunction. Moving away from the current design would change
> HashFunction
> >> to specify methods to accept blocks of data and have a final method to
> get
> >> the hash. So making the HashFunction an online hash. This would then
> match
> >> Guava by having the some object accept items T and require a function to
> >> map the item T to blocks of data.
> >>
> >> However I note that the current implementation that accepts a byte[] for
> >> each call to get a hash value with a different seed can either use the
> >> byte[] or not (if cyclic). If the HashFunction was online then the
> choice
> >> of cyclic or iterative would not easily be possible. The Guava
> >> implementation creates a single hash and then the BloomFilter always
> uses
> >> this with a cyclic method. So the move away from the current design
> would
> >> be less flexible to allow different implementations of hashing.
> >>
> >> So we keep the byte[] interface to HashFunction for now. A performance
> >> test can be used to determine if there is an advantage to an advanced
> >> DynamicHasher.Builder which can recycle byte[] space. Javadoc should be
> >> added to the HashFunction to indicate that the same bytes passed with
> the
> >> same seed should create the same output. The same bytes with a different
> >> seed should create different output with very high probability. A seed
> of
> >> zero is used as a reset signal for implementations that have cached
> >> computation results that the byte[] input is different from the previous
> >> call.
> >>
> >>
> >> The last thing is that the Hasher.isEmpty() is not used anywhere except
> >> the units tests. It seems strange to have it. Can we just assume a
> Hasher
> >> is not empty. An empty hasher would return an iterator that does
> nothing.
> >>
> >> Yes, I guess we don't need isEmpty() I think it was originally used in
> > BloomFilter.merge() in a guard statement to test if the merge actually
> > needed to attempt to do anything.
>
> I have removed it.
> >
> >
> >> In summary:
> >>
> >> 1. change Hasher getBits to iterator
> >>
> > agree
>
> Done.
>
> >
> >> 2. improve documentation of Hasher and the contract that it should
> fulfil
> >> with respect to items and a Shape
> >>
> > absolutly
>
> I’ve updated the Javadoc as above.
>
> >
> >> 3. potentially drop Hasher.Builder unless there is a way to reset the
> >> Builder or create more
> >>
> > add the reset or have build() implicitly do a reset.
>
> I think we mandate Builder be reusable and that build() forces a reset. I
> also think dropping all the method except with(byte[]) makes it very clear
> that the builder is to accept complete items on each call.
>
> >
> > 4. Or make Hasher.Builder typed to an object <T> so it is clear the
> with(…)
> >> methods are to accept a full representation of an item and add it to the
> >> in-progress Hasher currently being built
> >>
> > disagree.
>
> Fine. We shall have a raw API for now and possibly decorators for objects
> hashing later.
>
> >
> > 5. Improve HashFunction javadoc on the use of the seed as a reset signal
> >
> > agree
>
> Here there are more options. If we document that apply(byte[], long) is
> called with a seed of zero to initialise and then non-zero seed to create
> new hashes for the same byte[] why not make the interface split into two
> methods:
>
> long hash(byte[]);
> long increment(int seed);
>
> We already have the ProcessType property for the HashFunction that allows
> the function to specify if the entire byte[] is used each time or if the
> initial hash is recycled when creating new values from a non-zero seed.
> From what I can see in the implementations the cyclic functions would just
> ignore the seed in the increment(int) method. The iterative functions use
> the seed.
>
> I tweaked the ObjectsHashIterative HashFunction. But it seems to me that
> it could be further improved if we can assume that apply(byte[], int) is
> called with the same byte[] when the seed is non-zero. If this is true then
> the Arrays.hashCode(byte[]) result can be cached for reuse. The function
> should also be made UNSIGNED as the 32-bit result can be converted to an
> unsigned long.
>
> Splitting the HashFunction interface into two methods makes it clear that
> an implementation will be called multiple times with the same byte[]. This
> is part of the design of the Hasher and so should be formalised into the
> design of the HashFunction.
>
> WDYT?
>
> >
> >> 6. Drop Hasher.isEmpty()
> >>
> > ambivalent.
>
> Dropped.
>
> >
> >>
> >> That should clarify the currently functionality.
> >>
> >> Thought on this?
> >>
> >> Alex
> >>
> >>
> >>>
> >>> Claude
> >>>
> >>>
> >>> On Mon, Mar 16, 2020 at 6:34 PM Alex Herbert <alex.d.herb...@gmail.com
> >
> >>> wrote:
> >>>
> >>>>
> >>>> On 16/03/2020 07:57, Claude Warren wrote:
> >>>>> I made a quick pass at changing getHasher() to iterator().
> >>>>
> >>>> A look at the feasibility or have you started work on this? If so then
> >>>> I'll not start work on it as well.
> >>>>
> >>>> I changed master to return a boolean for the merge operations in
> >>>> BloomFilter. So the outstanding changes are to drop getHasher() from
> the
> >>>> BloomFilter interface in favour of an iterator, spliterator and a
> >>>> forEachBit method.
> >>>>
> >>>>> I think we can get rid of HasherBloomFilter as its purpose was really
> >> to
> >>>>> create a Bloom filter for temporary usage and it doesn't seem to be
> >>>>> required if we have a hasher that can be created from a Shape and a
> >>>>> function that creates an Iterator.
> >>>>
> >>>> I agree.
> >>>>
> >>>> One change that could be made is to clarify the contract between a
> >>>> Hasher and a BloomFilter. At present the Hasher can operate without a
> >>>> defined contract in this method:
> >>>>
> >>>> PrimitiveIterator.OfInt getBits(Shape shape)
> >>>>
> >>>> It should validate that it can generate indexes for the shape. But it
> >>>> doesn't have to. It could return unlimited indexes and they could be
> >>>> outside the number of bits of the BloomFilter.
> >>>>
> >>>> There does not appear to be any control anywhere on the number of hash
> >>>> functions generated by the Hasher. I would expect this test in the
> >>>> AbstractBloomFilterTest to pass:
> >>>>
> >>>>    @Test
> >>>>    public void hasherMergeTest() {
> >>>>        int n = 1;
> >>>>        int m = 10;
> >>>>        HashFunctionIdentity h = new
> >>>> HashFunctionIdentityImpl("provider", "name",
> >>>>            Signedness.SIGNED, ProcessType.CYCLIC, 0L);
> >>>>        Hasher hasher = new Hasher() {
> >>>>            @Override
> >>>>            public boolean isEmpty() {
> >>>>                return false;
> >>>>            }
> >>>>            @Override
> >>>>            public HashFunctionIdentity getHashFunctionIdentity() {
> >>>>                return h;
> >>>>            }
> >>>>            @Override
> >>>>            public OfInt getBits(Shape shape) {
> >>>>                // Do not respect the shape number of hash functions
> >>>> but do respect
> >>>>                // the number of bits
> >>>>                return IntStream.range(0, m).iterator();
> >>>>            }
> >>>>        };
> >>>>        for (int k = 1; k < 5; k++) {
> >>>>            Shape shape = new Shape(h, n, m, k);
> >>>>            BloomFilter bf = createEmptyFilter(shape);
> >>>>            bf.merge(hasher);
> >>>>            assertEquals("incorrect cardinality", k, bf.cardinality());
> >>>>        }
> >>>>    }
> >>>>
> >>>> It currently does not as all the BloomFilters will not respect the
> Shape
> >>>> with which they were created, i.e. they disregard the number of hash
> >>>> functions in the Shape. So does the Hasher.
> >>>>
> >>>> I think some of the control should be returned to the BloomFilter. The
> >>>> Hasher would be reduced to a simple generator of data for the
> >>>> BloomFilter, for example:
> >>>>
> >>>>    PrimitiveIterator.OfInt getBits(int m);
> >>>>    PrimitiveIterator.OfInt getBits(int k, int m);
> >>>>    PrimitiveIterator.OfLong getBits();
> >>>>
> >>>> The BloomFilter then accept responsibility for converting the
> primitives
> >>>> to a suitable index and creating the correct number of hash functions
> >>>> (i.e. indexes).
> >>>>
> >>>> A merge operation with a BloomFilter then becomes:
> >>>>
> >>>> - check the Hasher is using the correct hash function identity
> >>>> - ask the Hasher for an iterator
> >>>> - set k bits in the filter using the iterator, mapping each to the
> range
> >>>> [0, m)
> >>>>
> >>>> The BloomFilter has then encapsulated its state and respects the
> Shape.
> >>>>
> >>>> The HashFuntion will convert byte[] to a long.
> >>>>
> >>>> The Hasher exists to convert anything to a byte[] format.
> >>>>
> >>>> This perhaps needs the Hasher.Builder to be revised to include more
> >>>> methods that accept all the primitive data types. These are all
> >>>> converted to a single byte[] representation. Thus the Hasher.Builder
> is
> >>>> effectively a specification for a ByteBuffer. Once an object is
> >>>> decomposed into the byte[] it can be fed through the HashFunction with
> >>>> different seeds or using the cyclic method to create the iterator. The
> >>>> BloomFilter consumes the raw long output from the stream produced by
> the
> >>>> Hasher and sets k bits within the range m.
> >>>>
> >>>> Alex
> >>>>
> >>>>
> >>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
> >>>> For additional commands, e-mail: dev-h...@commons.apache.org
> >>>>
> >>>>
> >>>
> >>> --
> >>> I like: Like Like - The likeliest place on the web
> >>> <http://like-like.xenei.com>
> >>> LinkedIn: http://www.linkedin.com/in/claudewarren
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org <mailto:
> dev-unsubscr...@commons.apache.org>
> >> For additional commands, e-mail: dev-h...@commons.apache.org <mailto:
> dev-h...@commons.apache.org>
> >>
> >>
> >
> > --
> > I like: Like Like - The likeliest place on the web
> > <http://like-like.xenei.com <http://like-like.xenei.com/>>
> > LinkedIn: http://www.linkedin.com/in/claudewarren <
> http://www.linkedin.com/in/claudewarren>
>


-- 
I like: Like Like - The likeliest place on the web
<http://like-like.xenei.com>
LinkedIn: http://www.linkedin.com/in/claudewarren

Re: [BloomFilters] changes to BloomFilter

Reply via email to