Dave wrote: > I'm trying to clarify how Python avoids byte by byte > string comparisons most of the time. As I understand, > dictionaries keep strings, their keys (hash values), > and caches of their keys.
If you are talking about dictionaries with string keys, you also have the dictionary values, of course. > Caching keys helps to avoid > recalculation of a string's hash value. Correct (s/keys/string hash values/). > So, when two > strings need to be compared, only their cached keys > are compared, which improves performance as there is > no need for byte by byte comparison. No. When comparing to strings s1 and s2, the following operations occur: 1. is s1 and s2 the very same string? If yes, they are equal. 2. else, do they have the same size, the same first byte (which might be a null byte), and do they compare equal, byte-by-byte? If yes, they are equal, if not, they are not equal. 3. Is it perhaps some other compare operation (<,> <=, >, >=) that we want to perform? Do the slow algorithm. As you can see, the string hash is never consulted when comparing string objects. It is only consulted to find the potential dictionary slot in the first place. > Also, there is a global interning dictionary that > keeps interned strings. What I don't understand is why > strings are interned. How does it help with string > comparisons? Why you look up a dictionary entry, this happens: 1. compute the key hash. 2. find the corresponding dictionary slot If the slot is empty, KeyError. 3. compare the slot key with the search key. If they are equal: value found. If they are different: collision, go to the next key. Interned strings speed up step 1 and step 3. If you only have interned strings throughout, you always also have the hash value. Of course, you had to compute the hash value when adding the string to the interning dictionary. The real speedup is in step 3, and based on the assumption that collisions won't happen: if you lookup the key (e.g. to find the value of a global variable), you find the slot using the computed hash. Then: 1. if the slot is empty, it's a KeyError. 2. if the slot is not empty, you first compare for pointer equality. As collisions are supposedly unlikely, this will be an equal string most of the time. Then, if you have interning, it even will be the *same* string, so you only need to compare pointers to find out they are the same string. So assuming all strings are interned, this is how you do the dictionary lookup. 1. fetch the hash value from the key (no need to compute it - it's already cached) 2. go to the slot in the dict. 3. if the slot is empty (==NULL): KeyError 4. otherwise: success. As you can see, in that case, there is no need to compare the string values. If there are collisions, and if not all strings are interned, the algorithm gets more complicated, but four items above are assumed to be the common case. Regards, Martin -- http://mail.python.org/mailman/listinfo/python-list