Re: [Pharo-users] Hash collision

Andres Valloud Tue, 13 Dec 2016 08:40:42 -0800

On 12/10/16 10:36 , Martin McClure wrote:

On 12/09/2016 09:45 AM, Ben Coman wrote:


The suggested fix depended on being willing to let go of byte ordering
(which we might not want different results on different platforms)
    Float>>hash
^ByteArray
   hashBytes: self
   startingWith: self species hash

This fix contains a good idea, but has at least two problems:
1) The check for integers is important, that's what makes sure that 1.0
has the same hash as 1. Which it must, since they compare equal.
2) #hashBytes:startingWith: is designed to hash *bytes*. A BoxedFloat64
is two *words*, which ensures that when you hashMultiply the high-order
bits of each word get discarded.

The high order bits of each word, yes. However, I would have expectedthe byte oriented code to be, basically,


h := some constant.
bytes do: [:x | h * 1644525 + x]

or

h := some constant.
bytes do: [:x | h + x * 1644525]

In either case, the sign bit in one of the x values should have made adifference. Apparently it didn't. What am I missing here?

I tried implementing a hash that actually uses the bytes of the float
(code at bottom of this message). I didn't do any real analysis of it,
but at least it gives different answers for 0.5 and -0.5. I think it
matches the *intent* of the code above (though would still need the
integer check, which I didn't bother with).

Side comment: equality between integers, floats, fractions and othernumbers induces a desire to have equal hashes... in my opinion, theresulting complexity hard to justify.

I see Squeak5's float hash has changed...
     Float>>hash "Squeak5"
(self isFinite and: [self fractionPart = 0.0]) ifTrue: [^self
truncated hash].
^ ((self basicAt: 1) bitShift: -4) +
  ((self basicAt: 2) bitShift: -4)

This is slightly better. It's only discarding eight bits out of 64,
instead of discarding 32. But it would be better to mix all the bits
into the hash.


Yes.

Note that all of these hash implementations end up manipulating at least
one large integer most of the time, since each word can be a large
integer. It's possible that a primitive to copy a word object into the
equivalent bytes might speed things up somewhat.


... or use the byte oriented hash.

"Strictly experimental method on BoxedFloat64, could probably be made
faster even without a primitive."

bytesHash
    | bytes word1 word2 |
    bytes := ByteArray new: 8.
    word1 := self basicAt: 1.
    word2 := self basicAt: 2.
    bytes at: 1 put: (word1 bitShift: -24).
    bytes at: 2 put: ((word1 bitShift: -16) bitAnd: 16rFF).
    bytes at: 3 put: ((word1 bitShift: -8) bitAnd: 16rFF).
    bytes at: 4 put: (word1  bitAnd: 16rFF).
    bytes at: 5 put: (word2 bitShift: -24).
    bytes at: 6 put: ((word2 bitShift: -16) bitAnd: 16rFF).
    bytes at: 7 put: ((word2 bitShift: -8) bitAnd: 16rFF).
    bytes at: 8 put: (word2  bitAnd: 16rFF).
    ^ ByteArray hashBytes: bytes startingWith: self class hash

I would have expected the floats to be a byte object of size 8. Why isthis conversion needed? Is somehow the primitive thinking the float hassize 2? Or is the primitive hashing 32 bits at a time? The prim usedto be a C version of the byte oriented 1644525 multiplicative hash, didthis change?


Andres.

Re: [Pharo-users] Hash collision

Reply via email to