I agree with the HashFunction changes. The only place I know of where more items are added to a hasher is in our test code. So I think just requiring Builder.build() to do a reset is correct solution. I think Builder should have with(byte[]) with(byte[], int offset, int len ) with(String)
I find that I use with(String) more than any other with() method. If you want to add with(ByteBuffer) with the implementation you have above I think that would be a good addition. Anything else can probably be done with a Decorator. I would like to see https://github.com/apache/commons-collections/pull/131 get merged so that we can have more than one example of a hasher that actually hashes. On Tue, Mar 17, 2020 at 1:53 PM Alex Herbert <alex.d.herb...@gmail.com> wrote: > > > > On 17 Mar 2020, at 11:08, Claude Warren <cla...@xenei.com> wrote: > > > > On Tue, Mar 17, 2020 at 12:28 AM Alex Herbert <alex.d.herb...@gmail.com > <mailto:alex.d.herb...@gmail.com>> > > wrote: > > > >> > >>> The Shape tells the hasher how many hash functions to apply to each > item. > >> > >> OK. This is may misunderstanding. There is a contract that the Hasher is > >> expected to fulfil but it is just not recorded in the javadoc. I can > update > >> the docs to indicate that: > >> > >> "A Hasher represents items of arbitrary byte size as a byte > representation > >> of fixed size (a hash). The hash for each item is created using a hash > >> function; use of different seeds allows generation of different hashes > for > >> the same item. The hashes can be dynamically converted into the bit > index > >> representation used by a Bloom filter. The shape of the Bloom filter > >> defines the number of indexes per item and the range of the indexes. The > >> hasher functions to generate the correct number of indexes in the range > >> required by the Bloom filter for each item it represents. > >> > >> Note that the process of generating hashes and mapping them to a Bloom > >> filter shape may create duplicate indexes. The hasher may generate fewer > >> than the required number of hash functions per item if duplicates have > been > >> removed." > >> > >> Looks Good > > Added to master. > > > > >>> The Shape number of items is how many items are expected to be in the > >> final > >>> Bloom filter, it is more the expected value not a hard limit. > >> > >> Yes. As discussed before this is not actually required for a Bloom > filter > >> to function, it is required to maintain the intended purpose of the > filter > >> when it was constructed. > >> > >> > >> > >>> > >>> The static hasher for example will not return duplicates so it might > >> appear > >>> that it has not respected the number of functions. In addition there > is > >> no > >>> indication from the hasher how many items it contains.. > >> > >> Yes. So we state that the hasher represents one or more items. > >> > >>> > >>> The inputs to the hash.builder are byte buffers that are fed to the > hash > >>> algorithm. They are inputs to that algorithm. So primitive types > would > >>> simply convert from the primitive type to its byte buffer > representation. > >>> Is that what you meant? > >> > >> I was unclear on the purpose of the Hasher.Builder. It seemed > incomplete. > >> If the builder is to add items then it seems strange to have: > >> > >> with(byte property) > >> with(String property) > >> > >> It also seems strange to throw 'IllegalStateException if the Hasher is > >> locked’ without explaining what this means. Is the builder intended to > be > >> concurrent? What is ‘locked’? Etc. > >> > > > > Good catch. Documentation error. Originally the Hasher.Builder was > locked > > once it generated a Hasher that was removed but the documentation was not > > cleaned up. > > I’ve removed this from the Hasher.Builder interface. > > > > > > >> The byte could not possibly represent many meaningful objects. The > string > >> is trivially converted to UTF-8 bytes (as is done in the DynamicHasher). > >> Both these methods could be added to the interface as default methods or > >> preferrably dropped as they are so trivial. > >> > >> I changed the documentation to remove the encoding as UTF-8 requirement > >> from the with(String) method. It seems like an implementation detail > and a > >> Hasher.Builder implementation can decide how to convert the String. It > is > >> faster to use UTF-16 bytes for instance. I understand UTF-8 is for > >> cross-platform standard. But mandating that it has to be done is too > >> restrictive IMO. It would be better as: > >> > >> with(CharSequence, Charset) > >> withUnencoded(CharSequence) > >> > >> > > The Hasher.Builder is patterned after the cryptography MessageDigest > > pattern where a number of update() calls are made before a final digest() > > produces the hash. Originally it was update() and build() but earlier > > suggestion was to change the update() to with(). > > > > Adding with( byte[] buffer, int offset, int len ) makes sense. > > > > with(String ) is a convenience method, it was documented as UTF-8 so that > > if when trying to build equivalent filters in a different language there > > was no need to look deep in the hasher code to figure out what to do. > > > > with( byte ) is also a convenience method. > > > > They are documented in Hasher.Builder to be common across all builders. > > Obviously different Builder implementations can add other functionality. > > > > One could build a decorator that wraps a builder and adds the with(<T>) > > functionality. > > > > The Builder interface is designed to be a balance between minimal > > implementation need to be functional and an implementation that can > handle > > all arbitrary types. > > OK. What I think is different from the MessageDigest pattern is that a > MessageDigest is designed to receive byte data that is composes into a > single digest representing the combined bytes. > > Here the Hasher.Builder is designed to create a Hasher that represents one > or more items. Each item is a set of byte data. So each method to add data > is effectively functioning as a complete addition of a single item. This is > how the implementation of the DynamicHasher works. Each call to add data > will add a byte[] to the List<byte[]>. When you call iterator(Shape) on the > resulting Hasher you will get an iterator that generates k indexes per > byte[] buffer that was added. > > Thus my point that adding a single byte is confusing. Not many entire > objects would have a single byte representation! > > It may be simpler to drop the methods with(String) and with(byte). There > is the option to add: > > with(byte[], int offset, int length) > > However I am reluctant to do this as it would just be a delay of the > inevitable copy to a duplicate array of the correct length given that the > HashFunction.apply(byte[], int) method does not support a range. So you > would have to extract the entire byte[] anyway. > > To aid the process of adding byte[] to the Hasher.Builder we can provided > a ByteBuilder that would function similarly to a StringBuilder with append > methods for all sorts of data. However I am wary that the JDK did not make > ByteBuffer expandable as it impacts performance. A ByteBuilder would > effectively be a duplicate of ByteBuffer but expandable. So perhaps we add > instead this method: > > with(ByteBuffer) > > The method will extract the current contents of the ByteBuffer between the > current position and the limit into a new byte[] and add it to the builder: > > default Builder with(ByteBuffer) { > final int remaining = byteBuffer.remaining(); > final byte[] byteArray = new byte[remaining]; > byteBuffer.get(byteArray); > return with(byteArray); > } > > Thus this is a bridge between hashing byte[] and the ability to hash > anything, e.g. > > Hasher.Builder builder = ...; > ByteBuffer bb = ByteBuffer.allocate(100); > bb.putInt(42); > bb.putDouble(123.45); > bb.flip(); > builder.with(bb); > bb.putInt(3); > bb.putLong(78493577979L); > bb.flip(); > Hasher hasher = builder.with(bb).build(); > > Given the design is fixed on having byte[] to pass entirely to a > HashFunction then we cannot remove the need to create intermediate byte[] > storage. > > An alternative is to formalise the convertor idea with a typed function: > > <T> Builder with(T item, Function<T, byte[]> convertor); > > Or provide a Builder decorator (which you discuss later): > > class BuilderDecorator<T> implements Hasher.Builder { > Function<T, byte[]> convertor; > Hasher.Builder builder; > > BuilderDecorator(Function<T, byte[]> convertor, Hasher.Builder > builder) { > this.convertor = convertor; > this.builder = builder; > } > > @Override > public Hasher build() { > return builder.build(); > } > > @Override > public BuilderDecorator<T> with(byte[] property) { > builder.with(property); > return this; > } > > public BuilderDecorator<T> with(T item) { > return with(convertor.apply(item)); > } > } > > This is for the future perhaps. > > > > > > >> I was interpreting the Hasher.Builder as a builder of a single byte[] > for > >> hashing where you would pass different primitive values or byte[] for > the > >> same Object you want to convert. This is essentially a ByteBuffer. But > if > >> it is to receive an entire object for each call then (a) it should be > >> documented as such; (b) it should be simplified to just the byte[] > method > >> with perhaps another one/two: > >> > >> with(byte[]) > >> with(byte[], int length) > >> with(T) > >> with(ByteBuffer) > >> > > > >> Adding the T method would make the interface typed as Hasher.Builder<T>. > >> It would require a function to convert items T to a byte[]: > >> > >> > > I think the T and ByteBuffer implementations are better left to a > decorator > > class. But that is just my take. Since the Bloom filter only supports a > > logical equivalent of equals putting numerical values into a Bloom filter > > is not that common. My experience is that most values are Strings. > > So we drop all method from the Builder except with(byte[]) and change the > ‘property’ to ‘item’ to clarify that the byte[] is representing an entire > item, not a property of a larger item. > > > > > Collection<T> items = ... > >> BloomFilter bf = … > >> Function<T, byte[]> converter = … > >> HashFunction hf = ... > >> > >> for (T item : items) { > >> bf.merge(new DynamicHasher.Builder<>(hf, > >> converter).with(item).build()); > >> } > >> > >> Or: > >> > >> DynamicHasher.Builder<T> builder = new DynamicHasher.Builder<>(hf, > >> converter); > >> for (T item : Collection<T>) { > >> builder.with(item); > >> } > >> bf.merge(builder.build()); > >> > >> I think the Hasher.Builder interface can be removed. It does not really > >> add anything to the API without a factory somewhere to be able to create > >> Hasher.Builder instances since each instance has no methods for reset: > >> > >> At one time Builder.build() performed a reset but that was removed by > > request, I would have no issues putting it back, or adding a reset > method. > > Add: > > Builder reset(); > Hasher buildAndReset(); > > Or force build() to do a reset as well? It is simplest to make build() > mandate a reset(). Any situations where you would want to do incremental > builds by adding more items to a builder and getting a bigger Hasher as a > super-set of the previous Hasher? > > > > > The Hasher.Builder interface defines a common interface across all > > builders. So I can write my code to build Bloom filters and swap out the > > hasher implementation easily as long as I stick to the Builder.Hasher > > methods. > > > > Hasher h = factory.create().with(x).with(y).with(z).build(); > >> > >> If you do not have either a factory to create a Hasher.Builder or the > >> ability to reset a Builder then why have a Hasher.Builder interface? > >> Passing around just a single instance of the builder has limited use. I > >> would drop the interface and leave it to Hasher implementations to > define > >> how they want to be constructed. > >> > >>> > >>> The hasher contract is that it will generate integers in the proper > range > >>> and use the proper number of hash functions for each item that was > added > >> to > >>> the builder and that repeated calls to getBits(Shape) will return the > >> same > >>> values. > >>> > >>> Did I misunderstand something? > >> > >> No, I did. Hence the need to clarify all the javadoc. > >> > >> What I think we are missing with the Hasher is the simplicity of the > Guava > >> implementation. What you ideally would like to do is: > >> > >> Collection<T> items = ... > >> BloomFilter bf = … > >> > >> for (T item : items) { > >> bf.merge(item); > >> } > >> > >> This change significantly reduces the applicability of this > implementation > > of Bloom filter across applications. I am currently working on several > > applications where Hashers are sent across a network to query > repositories. > > > > By separating the Hasher from the Filter (separation of concerns) then I > > don't have to send large objects across the network to do the query. Nor > > do I have to expose the properties I am querying for. In addition the > > endpoints I am querying are free to resize the bloom filters they store > as > > they see fit based on size of collection and other Shape based concerns. > > With the Guava implementation this is not possible. > > > > Note: "properties" in the above paragraph are items placed into a single > > builder. > > So as above, change ‘property’ to ‘item’. > > > > > Currently you have to do something like: > >> > >> Collection<T> items = ... > >> BloomFilter bf = … > >> Function<T, Hasher> itemToHasher = … > >> > >> for (T item : items) { > >> bf.merge(itemToHasher.apply(item)); > >> } > >> > >> The itemToHasher is effectively an improved DynamicHasher.Builder<T> as > >> above. > >> > > > > Yes, and simple to construct from the components in the bloomfilter > > packages. In addition, I think you have made the assumption that T > > contains a single value to be added, in which case a Function<T,byte[]> > > would suffice and, assuming bf is not a CountingBloom filter, the above > can > > be written as: > > > > Hasher.Builder builder = new .... > > Function<T,byte[]> f = new ... > > items.iterator().forEachRemaining( t -> builder.with( f.apply(t) ) > > bf.merge( builder.build() ) > > > > Additionally if T is an object with multiple elements that are to be > added > > to the filter then > > > > Hasher.Builder nativeBuilder = new .... > > DecoratorT builder = new DecoratorT( nativeBuilder ) // adds the with(T) > > method > > items.iterator().forEachRemaining( t -> builder.with( t ) ) > > bf.merge( builder.build() ) > > > > there is no difference in result between > > - > > loop { > > bf.merge( new Builder().with(x).build() ) > > } > > > > -- and -- > > > > new Builder() > > loop { > > builder.with( x ) > > } > > bf.merge( builder.build() ); > > > > > >> It would also be possible for the itemToHasher to recycle byte[] space > by > >> returning the same Hasher object that has been reset and filled with the > >> next item. This would not be thread-safe but would save on intermediate > >> storage. > >> > >> All of this still fixes on having a byte[] representation to feed to a > >> HashFunction. Moving away from the current design would change > HashFunction > >> to specify methods to accept blocks of data and have a final method to > get > >> the hash. So making the HashFunction an online hash. This would then > match > >> Guava by having the some object accept items T and require a function to > >> map the item T to blocks of data. > >> > >> However I note that the current implementation that accepts a byte[] for > >> each call to get a hash value with a different seed can either use the > >> byte[] or not (if cyclic). If the HashFunction was online then the > choice > >> of cyclic or iterative would not easily be possible. The Guava > >> implementation creates a single hash and then the BloomFilter always > uses > >> this with a cyclic method. So the move away from the current design > would > >> be less flexible to allow different implementations of hashing. > >> > >> So we keep the byte[] interface to HashFunction for now. A performance > >> test can be used to determine if there is an advantage to an advanced > >> DynamicHasher.Builder which can recycle byte[] space. Javadoc should be > >> added to the HashFunction to indicate that the same bytes passed with > the > >> same seed should create the same output. The same bytes with a different > >> seed should create different output with very high probability. A seed > of > >> zero is used as a reset signal for implementations that have cached > >> computation results that the byte[] input is different from the previous > >> call. > >> > >> > >> The last thing is that the Hasher.isEmpty() is not used anywhere except > >> the units tests. It seems strange to have it. Can we just assume a > Hasher > >> is not empty. An empty hasher would return an iterator that does > nothing. > >> > >> Yes, I guess we don't need isEmpty() I think it was originally used in > > BloomFilter.merge() in a guard statement to test if the merge actually > > needed to attempt to do anything. > > I have removed it. > > > > > >> In summary: > >> > >> 1. change Hasher getBits to iterator > >> > > agree > > Done. > > > > >> 2. improve documentation of Hasher and the contract that it should > fulfil > >> with respect to items and a Shape > >> > > absolutly > > I’ve updated the Javadoc as above. > > > > >> 3. potentially drop Hasher.Builder unless there is a way to reset the > >> Builder or create more > >> > > add the reset or have build() implicitly do a reset. > > I think we mandate Builder be reusable and that build() forces a reset. I > also think dropping all the method except with(byte[]) makes it very clear > that the builder is to accept complete items on each call. > > > > > 4. Or make Hasher.Builder typed to an object <T> so it is clear the > with(…) > >> methods are to accept a full representation of an item and add it to the > >> in-progress Hasher currently being built > >> > > disagree. > > Fine. We shall have a raw API for now and possibly decorators for objects > hashing later. > > > > > 5. Improve HashFunction javadoc on the use of the seed as a reset signal > > > > agree > > Here there are more options. If we document that apply(byte[], long) is > called with a seed of zero to initialise and then non-zero seed to create > new hashes for the same byte[] why not make the interface split into two > methods: > > long hash(byte[]); > long increment(int seed); > > We already have the ProcessType property for the HashFunction that allows > the function to specify if the entire byte[] is used each time or if the > initial hash is recycled when creating new values from a non-zero seed. > From what I can see in the implementations the cyclic functions would just > ignore the seed in the increment(int) method. The iterative functions use > the seed. > > I tweaked the ObjectsHashIterative HashFunction. But it seems to me that > it could be further improved if we can assume that apply(byte[], int) is > called with the same byte[] when the seed is non-zero. If this is true then > the Arrays.hashCode(byte[]) result can be cached for reuse. The function > should also be made UNSIGNED as the 32-bit result can be converted to an > unsigned long. > > Splitting the HashFunction interface into two methods makes it clear that > an implementation will be called multiple times with the same byte[]. This > is part of the design of the Hasher and so should be formalised into the > design of the HashFunction. > > WDYT? > > > > >> 6. Drop Hasher.isEmpty() > >> > > ambivalent. > > Dropped. > > > > >> > >> That should clarify the currently functionality. > >> > >> Thought on this? > >> > >> Alex > >> > >> > >>> > >>> Claude > >>> > >>> > >>> On Mon, Mar 16, 2020 at 6:34 PM Alex Herbert <alex.d.herb...@gmail.com > > > >>> wrote: > >>> > >>>> > >>>> On 16/03/2020 07:57, Claude Warren wrote: > >>>>> I made a quick pass at changing getHasher() to iterator(). > >>>> > >>>> A look at the feasibility or have you started work on this? If so then > >>>> I'll not start work on it as well. > >>>> > >>>> I changed master to return a boolean for the merge operations in > >>>> BloomFilter. So the outstanding changes are to drop getHasher() from > the > >>>> BloomFilter interface in favour of an iterator, spliterator and a > >>>> forEachBit method. > >>>> > >>>>> I think we can get rid of HasherBloomFilter as its purpose was really > >> to > >>>>> create a Bloom filter for temporary usage and it doesn't seem to be > >>>>> required if we have a hasher that can be created from a Shape and a > >>>>> function that creates an Iterator. > >>>> > >>>> I agree. > >>>> > >>>> One change that could be made is to clarify the contract between a > >>>> Hasher and a BloomFilter. At present the Hasher can operate without a > >>>> defined contract in this method: > >>>> > >>>> PrimitiveIterator.OfInt getBits(Shape shape) > >>>> > >>>> It should validate that it can generate indexes for the shape. But it > >>>> doesn't have to. It could return unlimited indexes and they could be > >>>> outside the number of bits of the BloomFilter. > >>>> > >>>> There does not appear to be any control anywhere on the number of hash > >>>> functions generated by the Hasher. I would expect this test in the > >>>> AbstractBloomFilterTest to pass: > >>>> > >>>> @Test > >>>> public void hasherMergeTest() { > >>>> int n = 1; > >>>> int m = 10; > >>>> HashFunctionIdentity h = new > >>>> HashFunctionIdentityImpl("provider", "name", > >>>> Signedness.SIGNED, ProcessType.CYCLIC, 0L); > >>>> Hasher hasher = new Hasher() { > >>>> @Override > >>>> public boolean isEmpty() { > >>>> return false; > >>>> } > >>>> @Override > >>>> public HashFunctionIdentity getHashFunctionIdentity() { > >>>> return h; > >>>> } > >>>> @Override > >>>> public OfInt getBits(Shape shape) { > >>>> // Do not respect the shape number of hash functions > >>>> but do respect > >>>> // the number of bits > >>>> return IntStream.range(0, m).iterator(); > >>>> } > >>>> }; > >>>> for (int k = 1; k < 5; k++) { > >>>> Shape shape = new Shape(h, n, m, k); > >>>> BloomFilter bf = createEmptyFilter(shape); > >>>> bf.merge(hasher); > >>>> assertEquals("incorrect cardinality", k, bf.cardinality()); > >>>> } > >>>> } > >>>> > >>>> It currently does not as all the BloomFilters will not respect the > Shape > >>>> with which they were created, i.e. they disregard the number of hash > >>>> functions in the Shape. So does the Hasher. > >>>> > >>>> I think some of the control should be returned to the BloomFilter. The > >>>> Hasher would be reduced to a simple generator of data for the > >>>> BloomFilter, for example: > >>>> > >>>> PrimitiveIterator.OfInt getBits(int m); > >>>> PrimitiveIterator.OfInt getBits(int k, int m); > >>>> PrimitiveIterator.OfLong getBits(); > >>>> > >>>> The BloomFilter then accept responsibility for converting the > primitives > >>>> to a suitable index and creating the correct number of hash functions > >>>> (i.e. indexes). > >>>> > >>>> A merge operation with a BloomFilter then becomes: > >>>> > >>>> - check the Hasher is using the correct hash function identity > >>>> - ask the Hasher for an iterator > >>>> - set k bits in the filter using the iterator, mapping each to the > range > >>>> [0, m) > >>>> > >>>> The BloomFilter has then encapsulated its state and respects the > Shape. > >>>> > >>>> The HashFuntion will convert byte[] to a long. > >>>> > >>>> The Hasher exists to convert anything to a byte[] format. > >>>> > >>>> This perhaps needs the Hasher.Builder to be revised to include more > >>>> methods that accept all the primitive data types. These are all > >>>> converted to a single byte[] representation. Thus the Hasher.Builder > is > >>>> effectively a specification for a ByteBuffer. Once an object is > >>>> decomposed into the byte[] it can be fed through the HashFunction with > >>>> different seeds or using the cyclic method to create the iterator. The > >>>> BloomFilter consumes the raw long output from the stream produced by > the > >>>> Hasher and sets k bits within the range m. > >>>> > >>>> Alex > >>>> > >>>> > >>>> > >>>> --------------------------------------------------------------------- > >>>> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org > >>>> For additional commands, e-mail: dev-h...@commons.apache.org > >>>> > >>>> > >>> > >>> -- > >>> I like: Like Like - The likeliest place on the web > >>> <http://like-like.xenei.com> > >>> LinkedIn: http://www.linkedin.com/in/claudewarren > >> > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org <mailto: > dev-unsubscr...@commons.apache.org> > >> For additional commands, e-mail: dev-h...@commons.apache.org <mailto: > dev-h...@commons.apache.org> > >> > >> > > > > -- > > I like: Like Like - The likeliest place on the web > > <http://like-like.xenei.com <http://like-like.xenei.com/>> > > LinkedIn: http://www.linkedin.com/in/claudewarren < > http://www.linkedin.com/in/claudewarren> > -- I like: Like Like - The likeliest place on the web <http://like-like.xenei.com> LinkedIn: http://www.linkedin.com/in/claudewarren