2009/2/27 Steve Holden <st...@holdenweb.com>: > Assuming no duplicates, how does this help? You still have to verify > collisions. > >> Pretty brutish and slow, but it's the first algorithm which comes to >> mind. Of course, I'm assuming that the list items are long enough to >> warrant using a hash and not the values themselves. >> > The problem is you can't guarantee any hash value will be unique, so > ultimately you have to check whether list entries hashing to the same > value are or are not equal. Which would mean either having the whole > list in memory or having access to it via some other random access method.
You don't need random access to the whole list, just to the records that collide on a hash. although you do end up using the external storage as random access, which can be very slow, you've possibly drastically reduced the number of such random accesses, which should be a big saving. The best case is if the number of collisions is low enough for the random access to be dealt with in buffering, avoiding wasteful seeks of the external storage, in which case it's a complete win. Remember that if the identical hashes are not collisions but genuine duplicates you can throw them away as soon as they're checked, so with some clever buffering you can stop them from clogging up the buffer. The worst case is if there are a lot of genuine collisions, in which case it's probably not a very good hash. -- Tim Rowe -- http://mail.python.org/mailman/listinfo/python-list