On Thu, 10 Aug 2017 07:00 pm, Peter Otten wrote: > Steven D'Aprano wrote: > >> On Wed, 09 Aug 2017 20:07:48 +0300, Marko Rauhamaa wrote: >> >>> Good point! A very good __hash__() implementation is: >>> >>> def __hash__(self): >>> return id(self) >>> >>> In fact, I didn't know Python (kinda) did this by default already. I >>> can't find that information in the definition of object.__hash__(): >> >> >> Hmmm... using id() as the hash would be a terrible hash function. Objects > > It's actually id(self) >> 4 (almost, see C code below), to account for > memory alignment.
Thanks for tracking that down. As you show, the default hash isn't id() itself. The C code says: > /* bottom 3 or 4 bits are likely to be 0; rotate y by 4 to avoid > excessive hash collisions for dicts and sets */ which I think agrees with my comment: using the id() itself would put too many objects in the same bucket (i.e. too many collisions). >>>> obj = object() >>>> hex(id(obj)) > '0x7f1f058070b0' >>>> hex(hash(obj)) > '0x7f1f058070b' > >>>> sample = (object() for _ in range(10)) >>>> all(id(obj) >> 4 == hash(obj) for obj in sample) > True > >> would fall into similar buckets if they were created at similar times, >> regardless of their value, rather than being well distributed. > > If that were the problem it wouldn't be solved by the current approach: > >>>> sample = [object() for _ in range(10)] >>>> [hash(b) - hash(a) for a, b in zip(sample, sample[1:])] > [1, 1, 1, 1, 1, 1, 1, 1, 1] Arguably that's a flaw with the current approach that (maybe?) makes object()'s hash too closely. But: - perhaps it doesn't matter in practice, since the hash is taken modulo the size of the hash table; - or maybe Python's dicts and sets are good enough that a difference of 1 is sufficient to give a good distribution of objects in the hash table; - or maybe it does matter, but since people hardly ever use object() itself as the keys in dicts, it doesn't come up. Here's your example with a class that inherits from object, rather than object itself: py> class X(object): ... pass ... py> sample = [X() for x in range(10)] py> [hash(b) - hash(a) for a, b in zip(sample, sample[1:])] [-5338, -10910, -2976, -2284, -21326, 4, -8, 2, -4] So my money is on object() being anomalous: because it is so small, the hashes end up so similar. For "typical" classes, the hash function does a much better job of mixing the hash values up. -- Steve “Cheer up,” they said, “things could be worse.” So I cheered up, and sure enough, things got worse. -- https://mail.python.org/mailman/listinfo/python-list