On Fri, Dec 9, 2016 at 10:51 PM, Hilaire <hila...@drgeo.eu> wrote:

> Hi,
>
> In Dr. Geo I rely a lot on hash value for the geometric object. It
> largely improves speed up to compare geometric items (to avoid
> duplicated items). For example to detect when the user is building an
> already created object.
>
> Also when describing geometric sketch with Smalltalk code, hundred or
> thousand of geometric items are created. The hash value speeds up
> equality check (otherwise true comparing is iterative in the geometric
> sketch tree and it slow down)
>
> But sometime hash collisions occur. I guess in that case, DrGeo should
> do the real equality check, or may be I am overusing the hash? I am
> interested on advices. As for now I am doing nothing special.
>
> Here is an annoying hash collision with Pharo 3/ Pharo 5 I think should
> not occur:
>
> (0.5@0.25) hash
>          67006464
> (-0.5@0.25) hash
>         67006464
>
>
> Squeak5.0 does not have this collision.
>
> It looks related to the Float hash implementation a bit different.
>
> What do you think?
>

More the point, this seems bad...
-0.5 hash "==> 57344"
0.5 hash "==> 57344"

I read page 307 of "Hashing in Smalltalk Theory and Practice" **
regarding Squeak's (Pharo's) Point>>hash method...
   "at this time we will not test this hash function for points with our
floating point data sets because we already know that [Pharo's] hash
function for floats is of bad quality."

** Errm... actually first time I've opened the book since I bought it a
couple of years ago.

And from page 273 reviewing Float's hash...
    "because of how the bitAnd: is used, the hash function will discard 32
of the original 64 bits.  In other words what seems a hash function using
all the availabel bytes is anything but. [Also] because of the way the two
16 bit chunks are added togather, by the time bitshift:-8 is sent we will
only have, at most, 17 bits worth of hash value.  Effectively all the
bitshift: accomplishes is to get rid of the 8 lowest bits, which are zero
anyway because of how the bitAnd: message were set up. Can these objections
add up to particularly poor performance? The answer is definitely yes -
since we will have only 17 bit hash values this hash function will be
particularly unsuitable for any data set containing more than 2^17 =
131,072 objects"

    Float>>hash "Pharo"
  (self isFinite and: [self fractionPart = 0.0]) ifTrue: [^self truncated
hash].
  ^ (((self basicAt: 1) bitAnd: 16r00FFFF00) +
    ((self basicAt: 2) bitAnd: 16r00FFFF00)) bitShift: -8

The suggested fix depended on being willing to let go of byte ordering
(which we might not want different results on different platforms)
    Float>>hash
^ByteArray
   hashBytes: self
   startingWith: self species hash

but anyway that doesn't actually help this case...
    -0.5 hash "==>120484352"
    0.5 hash "==>120484352"

I see Squeak5's float hash has changed...
     Float>>hash "Squeak5"
(self isFinite and: [self fractionPart = 0.0]) ifTrue: [^self truncated
hash].
^ ((self basicAt: 1) bitShift: -4) +
  ((self basicAt: 2) bitShift: -4)

which gives...
    0.5 hash "==>66977792"
   -0.5 hash "==>201195520"

which solves the immediate problem, but I'm not capable of analysis its
quality further.

cheers -ben

Reply via email to