Re: Mutable objects which define hash (was Re: Why are tuples immutable?)

Nick Coghlan Wed, 29 Dec 2004 04:48:21 -0800

<Sections cut where I don't feel I have anything to add to what Bengt already 
said>

Bengt Richter wrote:

A second time a key may be hashed is when it is used as a lookup key. This can 
be a reference to
the identical key object first used, or it can be a new object. A new object 
has to be hashed in order
to have a hash value to use in finding candidate keys to compare to. If _this_ 
object
is used as a key again, it must be hashed again -- unless it is an object that 
caches its own
hash and invalidates it when mutated so it can rehash itself if its __hash__ 
method is subsequently
called.

Thus re-use of a different lookup key than the original provides an opportunity 
for a performance
gain in the lookup if the key object caches it hash value for return by 
__hash__.
That's what I meant by "why not?" in response to your assertion that the 
constant hash restriction
doesn't (ok, I took that as a "can't" ;-) provide any performance gain.

I'm not sure I'm following here. . . for the case of an object caching its own hash, that seems to be a performance gain that is independent of the dictionary implementation. From the dictionary's point of view, it just calls hash() and works with the result, exactly as it does now.

I agree hash caching gives good optimisation - my point was that hash caching can be just as effective in the presence of mutable keys, so long as there is some way to invalidate the cached hash values when a mutation occurs.

Given that you assume you know when "rehashing" is going to be necessary -- 
meaning you
know when keys mutate.

Well, yes. Antoon's original complaint was that mutable objects could not be used as keys in dictionary, even if the application guaranteed key stability for the lifetime of the relevant dictionary entry.

I gave dict.rehash as a method equivalent to list.sort when working with mutables keys or list entries, respectively.

Or resist the temptation to guess, and have rehash() raise an exception :)
Easy out ;-)


I certainly thought so }:>

The identity keyed dict was based on Antoon's posted use case: he wanted to classify mutable objects, and then check for membership in those groups.
I thought he just wanted to use mutables as if they were immutable, so that
equal mutable keys would look up the same value in a dict.

That was in the original post, yes. Later, he gave an example that ran something like:

  for node in graph:
    if node in some_dict:
      # Do NOT mutate this node
      # Possibly do other things
      continue
    # Do something that mutates the (non-key) node

You can use lists for this, but the lookup is slow. Conventional dictionaries and sets give a fast lookup, but require that the items in use be hashable.
A constant hash works. But since it clogs a single bucket and gives no clue
of value mismatch, we'd have to time it to see how it compares with list.index.
Did the Timbot do both?


Write them both? I don't know.

However, list.index is a simple linear search of the array of objects (I was reading that code recently for unrelated reasons).

The dict hash collision resolution was almost certainly written by Tim - he definitely wrote the comments at the top of the file trying to explain how the heck it works. I have no real clue as to the performance of the lookup algorithm in the degenerate case of a large number of distinct keys having the same hash value, but reading the code suggests it would perform worse than the simple integer increment that list.index uses (as the dict lookup requires multiple if-tests and so forth on each iteration).

An identity dictionary (or set) solves the 'object classification' problem perfectly.
I missed the definition of those terms I'm afraid ;-)

It was in some other part of this rambling thread. It's easier to repeat the idea than it is to go find where that was :)

An 'identity dictionary' is just a dict that looks up keys based on identity rather than value. It ignores the keys' actual hash and comparison operations, and forces use of the default hash for hashing, and identity comparison for equality testing. It maps from key _objects_ to values, rather than from key values to other values.

An 'identity set' is simply the same idea applied to a set instead of a dictionary. Conceptually, it stores sets of arbitrary objects, rather than sets of values.

The 'object classification' problem is simply the situation where you have an existing collection of objects (e.g. graphs of a node), and you want to classify them according to some criteria. Dictionaries or sets are a nice way to do that, but the standard types are a pain if your original objects are mutable (this inconvenience being the basis of Antoon's original complaint).

For a classification problem, though, we're likely only interested in what categories each of the original objects falls into - so identity-based storage will usually be entirely adequate (and, in some cases, actually better then value-based storage, since we can avoid irrelevant comparisons of potentially complex objects). And the performance should romp it in when compared to using lists for the categorisation.

Hence why I suggested Antoon should consider pursuing collections.identity_dict and collections.identity_set if identity-based lookup would actually address his requirements. Providing these two data types seemed like a nice way to do an end run around the bulk of the 'potentially variable hash' key problem.

Cheers,
Nick.

--
Nick Coghlan   |   [EMAIL PROTECTED]   |   Brisbane, Australia
---------------------------------------------------------------
            http://boredomandlaziness.skystorm.net
--
http://mail.python.org/mailman/listinfo/python-list

Re: Mutable objects which define __hash__ (was Re: Why are tuples immutable?)

Reply via email to

Re: Mutable objects which define hash (was Re: Why are tuples immutable?)