<Sections cut where I don't feel I have anything to add to what Bengt already said>
Bengt Richter wrote:
A second time a key may be hashed is when it is used as a lookup key. This can be a reference to the identical key object first used, or it can be a new object. A new object has to be hashed in order to have a hash value to use in finding candidate keys to compare to. If _this_ object is used as a key again, it must be hashed again -- unless it is an object that caches its own hash and invalidates it when mutated so it can rehash itself if its __hash__ method is subsequently called.
Thus re-use of a different lookup key than the original provides an opportunity for a performance gain in the lookup if the key object caches it hash value for return by __hash__. That's what I meant by "why not?" in response to your assertion that the constant hash restriction doesn't (ok, I took that as a "can't" ;-) provide any performance gain.
I'm not sure I'm following here. . . for the case of an object caching its own hash, that seems to be a performance gain that is independent of the dictionary implementation. From the dictionary's point of view, it just calls hash() and works with the result, exactly as it does now.
I agree hash caching gives good optimisation - my point was that hash caching can be just as effective in the presence of mutable keys, so long as there is some way to invalidate the cached hash values when a mutation occurs.
Given that you assume you know when "rehashing" is going to be necessary -- meaning you know when keys mutate.
Well, yes. Antoon's original complaint was that mutable objects could not be used as keys in dictionary, even if the application guaranteed key stability for the lifetime of the relevant dictionary entry.
I gave dict.rehash as a method equivalent to list.sort when working with mutables keys or list entries, respectively.
Or resist the temptation to guess, and have rehash() raise an exception :)
Easy out ;-)
I certainly thought so }:>
The identity keyed dict was based on Antoon's posted use case: he wanted to classify mutable objects, and then check for membership in those groups.
I thought he just wanted to use mutables as if they were immutable, so that equal mutable keys would look up the same value in a dict.
That was in the original post, yes. Later, he gave an example that ran something like:
for node in graph: if node in some_dict: # Do NOT mutate this node # Possibly do other things continue # Do something that mutates the (non-key) node
You can use lists for this, but the lookup is slow. Conventional dictionaries and sets give a fast lookup, but require that the items in use be hashable.
A constant hash works. But since it clogs a single bucket and gives no clue of value mismatch, we'd have to time it to see how it compares with list.index. Did the Timbot do both?
Write them both? I don't know.
However, list.index is a simple linear search of the array of objects (I was reading that code recently for unrelated reasons).
The dict hash collision resolution was almost certainly written by Tim - he definitely wrote the comments at the top of the file trying to explain how the heck it works. I have no real clue as to the performance of the lookup algorithm in the degenerate case of a large number of distinct keys having the same hash value, but reading the code suggests it would perform worse than the simple integer increment that list.index uses (as the dict lookup requires multiple if-tests and so forth on each iteration).
An identity dictionary (or set) solves the 'object classification' problem perfectly.
I missed the definition of those terms I'm afraid ;-)
It was in some other part of this rambling thread. It's easier to repeat the idea than it is to go find where that was :)
An 'identity dictionary' is just a dict that looks up keys based on identity rather than value. It ignores the keys' actual hash and comparison operations, and forces use of the default hash for hashing, and identity comparison for equality testing. It maps from key _objects_ to values, rather than from key values to other values.
An 'identity set' is simply the same idea applied to a set instead of a dictionary. Conceptually, it stores sets of arbitrary objects, rather than sets of values.
The 'object classification' problem is simply the situation where you have an existing collection of objects (e.g. graphs of a node), and you want to classify them according to some criteria. Dictionaries or sets are a nice way to do that, but the standard types are a pain if your original objects are mutable (this inconvenience being the basis of Antoon's original complaint).
For a classification problem, though, we're likely only interested in what categories each of the original objects falls into - so identity-based storage will usually be entirely adequate (and, in some cases, actually better then value-based storage, since we can avoid irrelevant comparisons of potentially complex objects). And the performance should romp it in when compared to using lists for the categorisation.
Hence why I suggested Antoon should consider pursuing collections.identity_dict and collections.identity_set if identity-based lookup would actually address his requirements. Providing these two data types seemed like a nice way to do an end run around the bulk of the 'potentially variable hash' key problem.
Cheers, Nick.
-- Nick Coghlan | [EMAIL PROTECTED] | Brisbane, Australia --------------------------------------------------------------- http://boredomandlaziness.skystorm.net -- http://mail.python.org/mailman/listinfo/python-list