Re: Mutable objects which define hash (was Re: Why are tuples immutable?)

Bengt Richter Tue, 28 Dec 2004 14:10:09 -0800

On Sun, 26 Dec 2004 12:17:42 +1000, Nick Coghlan <[EMAIL PROTECTED]> wrote:

>Bengt Richter wrote:
>> I take it you are referring to regular python dictionaries,
>
>Correct. So I'll skip over the sections talking about alternate lookup 
>strategies in my reply.
Cool.
>
>>>Anyway, what are the consequences of a type which has a 'variable hash'?
>>>
>>>The only real bad consequence is that the internal state of a dictionary can 
>>>break if someone alters the hash of one of its keys. The restriction to 
>>>'constant hashes' is aimed squarely at eliminating this class of programming 
>>>errors. It doesn't actually provide any performance gain.
>> 
>> Why not? It means you only have to compute the hash for any given object 
>> once,
>> and you can cache the hash value. That could be a biggie if you had megabyte 
>> trees
>> of nested tuples as keys and reused them frequently for lookup.
>
>Allowing mutation doesn't mean abandoning your caching - it just means you 
>need 
>a way to tell the dict to update it's key cache (i.e. rehash())
There's more than one way a key gets hashed: once when the key:value association
is entered into the dict state. (It pays to save the hash value in the dict 
along with references
to the key and value objects for several reasons, which I'll get to [1]).

A second time a key may be hashed is when it is used as a lookup key. This can 
be a reference to
the identical key object first used, or it can be a new object. A new object 
has to be hashed in order
to have a hash value to use in finding candidate keys to compare to. If _this_ 
object
is used as a key again, it must be hashed again -- unless it is an object that 
caches its own
hash and invalidates it when mutated so it can rehash itself if its __hash__ 
method is subsequently
called.

Thus re-use of a different lookup key than the original provides an opportunity 
for a performance
gain in the lookup if the key object caches it hash value for return by 
__hash__.
That's what I meant by "why not?" in response to your assertion that the 
constant hash restriction
doesn't (ok, I took that as a "can't" ;-) provide any performance gain.

[1] IME a hash is only necessary to optimize operation of a dict. The primary 
purpose is to limit
the number of comparisons necessary to determine whether a key exists in the 
dict, and to find the
matching one. First the hash value is converted to a bucket selection. This 
will involve some
function of the full hash value, reducing it to a value that can access 
buckets. This function can
potentially change as the dict grows. Once the bucket is selected, a number of 
keys may be found there.
Since thejob is to find equality, the hash value can serve again, using all its 
bits, to select a subset
of the keys in the bucket that _may_ be equivalent. This is an optimization to 
avoid comparing
key objects with differing hashes -- which is fine for immutable keys, which 
can't be the same
if their hashes differ. IOW, immutable key candidates may be rejected 
immediately if their hash values
aren't equal to the lookup key's hash value.

The other side of this coin is that keys whose hash values _are_ equal to the 
lookup key's hash value
must all be compared in some order (probably whatever order is convenient in 
working through a bucket)
until an equal key is found, or not.

This means that if we give all mutable keys the same hash value, they will all 
be in the same bucket,
and that bucket will be searched in some order for an equal key, and the first 
one found to be equal
will be what I called the effective unique key. With a constant hash value -- 
e.g., zero -- there
is no reason to re-hash: the hashes haven't changed ;-) There may be a reason 
to eliminate shadowed keys
with duplicate equal values, but it's not necessary for consistent operation.

Based on this, I made a wrapper object for mutable keys, which returns a 
constant zero as the hash value,
and provides a __cmp__ of the wrapped object for the dict's key equality 
testing. This means if W is the
wrapper class, you can use W(mutable) as a key in an ordinary dict, and they 
all go to the same bucket
as 0 and 0.0 and whatever else hashes to 0.

I decided to modify my dict subclass to wrap unhashable key objects 
automatically, with very similar effect
to the original, except eliminating the parallel lists for the mutable keys and 
their values. The only thing
is that I knew the order of list.index searching, but I don't know the order of 
bucket 0 searching, so
shadowing and elimination of duplicate keys by re-instantiation will probably 
eliminate different duplicates
sometimes.

>
>> I think, for one thing, it would be more efficient to notice that some keys 
>> were guaranteed not
>> to need rehashing... um, maybe by paritioning the key set into immutables 
>> and mutables ;-)
>
>True, since otherwise rehash() would have to check every key, even the 
>immutable 
>ones. However, that only affects the time required for a rehash(), rather than 
>general dictionary operation.
Given that you assume you know when "rehashing" is going to be necessary -- 
meaning you
know when keys mutate.

>
>> For another, it's not just a matter of "moving" keys that hash 
>> "incorrectly". If keys that
>> are a part of the state of the dict can mutate outside of dict methods, then 
>> keys can mutate
>> to equal each other, and "rehashing" will give equal hashes for previously 
>> unequal and distinct keys,
>> and you must choose which of the previous values is to be the value for the 
>> two keys that have now
>> become one by equality (however __eq__, __cmp__, __coerce__ and inheritance 
>> may define equality).
>
>Or resist the temptation to guess, and have rehash() raise an exception :)
Easy out ;-)

>
>> OTOH, if KeyError is frequent because of mutation, hashing may be next to 
>> useless overhead. IOW, DYFR ;-)
>
>Indeed. I defined my FR based on Antoon's analogy of the sorted list - have 
>the 
>dict behave like sorted list, so that if you screw the invariant, it's the 
>programmer's job to tell the dictionary to fix it.
>
And how ;-)

>>>>The problem is also that you really can't make new immutable objects
>> 
>> I don't understand that one. What do you mean by "new immutable object"?
>
>Bad terminology on Antoon's part, I think - I believe he meant "new immutable 
>type".
>
>I gave the Python 2.4's Decimal as an example of something which is 
>theoretically immutable, but doesn't actually enforce that immutability 
>properly.
>
>Overriding __setattr__ can give true immutability. It just makes the class a 
>serious pain to write :)
>
>> That is so different that I don't know how it could be used to debug an app 
>> that
>> uses a mutable key dict. I think Antoon was just asking for a key-mutation 
>> detector.
>> I think that's reasonable goal. A deep copy of all keys defined would permit 
>> that,
>> at least when dict methods had control. But you couldn't find the code that 
>> mutated
>> the originals without something outside the dict to catch it in the act.
>
>The identity keyed dict was based on Antoon's posted use case: he wanted to 
>classify mutable objects, and then check for membership in those groups.
I thought he just wanted to use mutables as if they were immutable, so that
equal mutable keys would look up the same value in a dict.

>
>You can use lists for this, but the lookup is slow. Conventional dictionaries 
>and sets give a fast lookup, but require that the items in use be hashable.
>
A constant hash works. But since it clogs a single bucket and gives no clue
of value mismatch, we'd have to time it to see how it compares with list.index.
Did the Timbot do both?

>An identity dictionary (or set) solves the 'object classification' problem 
>perfectly.
I missed the definition of those terms I'm afraid ;-)

>
>> You just need to DYFR ;-)
>
>I really do like this acronym :)
>
Me too. Should come in handy ;-)

BTW, here's a mutable key wrapper and the modified dict subclass that uses it:

 >>> class W(object):
 ...     """wrap an arbitrary object to make it usable as dict key"""
 ...     def __init__(self, obj): self.obj = obj
 ...     def __hash__(self):
 ...         try: return hash(self.obj)
 ...         except TypeError: return 0 # => same bucket for non-hashables
 ...     def __cmp__(self, other): return cmp(self.obj, other)
 ...     def __repr__(self): return 'W(%s)' % repr(self.obj)
 ...
 >>> class MKDW(dict):
 ...     """mutable-key dict with automatic wrapping of mutable keys"""
 ...     def __init__(self, arg=None, **kw):
 ...         if not arg and not kw : return
 ...         if arg is None: arg = []
 ...         if isinstance(arg, dict):
 ...             arg = arg.items()
 ...         arg = list(arg)
 ...         arg = arg + kw.items()
 ...         for k,v in arg:
 ...             try: self[k] = v
 ...             except TypeError: self[W(k)] = v
 ...     def __setitem__(self, key, value):
 ...         try: dict.__setitem__(self, key, value)
 ...         except TypeError: dict.__setitem__(self, W(key), value)
 ...     def __getitem__(self, key):
 ...         try: return dict.__getitem__(self, key)
 ...         except TypeError: return dict.__getitem__(self, W(key))
 ...     def __delitem__(self, key):
 ...         try: dict.__delitem__(self, key)
 ...         except (KeyError, TypeError): dict.__delitem__(self, W(key))
 ...     def __repr__(self):
 ...         return '<MKDW %s>' % dict.__repr__(self)
 ...
 >>> d = MKDW()
 >>> d
 <MKDW {}>
 >>> k1, k2, k3 = [1], [2], [3]
 >>> d[k1]=1
 >>> d
 <MKDW {W([1]): 1}>
 >>> d2 = MKDW((k,i+1) for i,k in enumerate([k1,k2,k3]))
 >>> d2
 <MKDW {W([1]): 1, W([2]): 2, W([3]): 3}>
 >>> d2 = MKDW(((k,i+1) for i,k in enumerate([k1,k2,k3])), a=1, b=2)
 >>> d2
 <MKDW {W([1]): 1, W([2]): 2, 'b': 2, 'a': 1, W([3]): 3}>
 >>> d2[[1]]
 1
 >>> d2[[2]]
 2
 >>> d2[[3]]
 3

 >>> k2[0]=3
Note how the modified k2 key shows up
 >>> d2
 <MKDW {W([1]): 1, W([3]): 2, 'b': 2, 'a': 1, W([3]): 3}>

It now shadows the k3 value
 >>> d2[k3]
 2

We can do this to "rehash" and clear out duplicates:
 >>> MKDW(d2)
 <MKDW {W([1]): 1, W([3]): 3, 'b': 2, 'a': 1}>
but the effect depends on how dict implements buckets.

Note that we can wrap a hashable too:
 >>> d3 = MKDW()
 >>> d3[W('a')] = 'eigh?'
 >>> d3
 <MKDW {W('a'): 'eigh?'}>

It winds up in the same hash bucket as the unwrapped object, so
key compares in that bucket find it:
 >>> d3['a']
 'eigh?'
 >>> d3['a'] = 'new a value'

Note that assignment via an equal-to-current key doesn't change
the dict's reference to the original key:
 >>> d3
 <MKDW {W('a'): 'new a value'}>

You could add extra parameters to W.__init__ and change __hash__
and __cmp__ to do what you wanted to do with graph nodes, I think,
even though I don't know quite what that was ;-)

Regards,
Bengt Richter
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Mutable objects which define __hash__ (was Re: Why are tuples immutable?)

Reply via email to

Re: Mutable objects which define hash (was Re: Why are tuples immutable?)