I'm writing a spider. I have millions of urls in a table (mysql) to check if a url has already been fetched. To check fast, I am considering to add a "hash" column in the table, make it a unique key, and use the following sql statement: insert ignore into urls (url, hash) values (newurl, hash_of_newurl) to add new url.
I believe this will be faster than making the "url" column unique key and doing string comparation. Right? However, when I come to Python's builtin hash() function, I found it produces different values in my two computers! In a pentium4, hash('a') -> -468864544; in a amd64, hash('a') -> 12416037344. Does hash function depend on machine's word length? If it does, I must consider another hash algorithm because the spider will run concurrently in several computers, some are 32-bit, some are 64-bit. Is md5 a good choice? Will it be too slow that I have no performance gain than using the "url" column directly as the unique key? I will do some benchmarking to find it out. But while making my hands dirty, I would like to hear some advice from experts here. :) -- http://mail.python.org/mailman/listinfo/python-list