Mark Dickinson <dicki...@gmail.com> added the comment:

> Why so?

Python's hash needs to obey the invariant that `hash(x) == hash(y)` for any two 
hashable objects with `x == y`. That makes life particularly hard for numeric 
types, because there are a number of *different* numeric types where equality 
can hold across those types. This includes not just built-in types, but third 
party types as well (think of NumPy, gmpy2, SymPy, and other libraries 
providing numbers that need to compare equal to Python numbers with the same 
value).

So for example, `hash(1.5)`,  `hash(Decimal("1.5"))`, `hash(Fraction(3, 2))`, 
`hash(1.5 + 0j)`, `hash(numpy.float32(1.5))`, `hash(bigfloat.BigFloat(1.5, 
precision=200))` must _all_ be equal to one another within a single running 
Python process.

Moreover, hash computation needs to be efficient for common types like floats 
and integers, while also not being impossibly slow for other types. (Early 
versions of Decimal's hash did a check to see whether the Decimal instance was 
an exact integer, and if so, converted that Decimal instance to an integer 
before taking its hash. But doing that with `Decimal(1e999999)` doesn't go 
well.)

It would definitely be *possible* to:

- compute a hash in a cross-type-compatible way
- do some sort of uniform post-processing of that hash, incorporating 
information from a per-process random salt

The operations described by Melissa O'Neill in her PCG paper give ideas for 
some ways to do such post-processing: regard the hash and the salt as together 
forming a 128-bit integer, and then collapse that 128-integer down to a 64-bit 
integer using one of the PCG post-processing methods. Note that as far as I 
know there's been no validation of those methods from a cryptographic (rather 
than a statistical) perspective.

However, it would be significant work, be disruptive not just to CPython, but 
to 3rd party packages and to other Python implementations, would slow down 
common hashing operations, and would increase the amount and the complexity of 
code that has to be maintained into the future.

So there's no shortage of reasons *not* to change the numeric hash. What I 
think we're lacking is a single reason *to* change it. Can you give a plausible 
example of a situation where the predictability of the numeric hash can lead to 
possible security issues?

See also the recent issue #37807.

> but *not* if my keys, say, are tuples of strings

Bad example. :-) The hash of a tuple is based on the hash of its contents. So 
if those contents are strings, the tuple benefits from the string hash 
randomization.

mirzakhani:~ mdickinson$ python -c "print(hash(('abc', 'def')))"
-824966383135019564
mirzakhani:~ mdickinson$ python -c "print(hash(('abc', 'def')))"
-5971473524922642515
mirzakhani:~ mdickinson$ python -c "print(hash(('abc', 'def')))"
5384650403450490974

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue29535>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to