Grant Edwards wrote: > On 2006-07-11, Qiangning Hong <[EMAIL PROTECTED]> wrote: > > > I'm writing a spider. I have millions of urls in a table (mysql) to > > check if a url has already been fetched. To check fast, I am > > considering to add a "hash" column in the table, make it a unique key, > > and use the following sql statement: > > insert ignore into urls (url, hash) values (newurl, hash_of_newurl) > > to add new url. > > > > I believe this will be faster than making the "url" column unique key > > and doing string comparation. Right? > > I doubt it will be significantly faster. Comparing two strings > and hashing a string are both O(N).
Playing Devil's Advocate: The hash would be a one-time operation during database insertion, whereas string comparison would happen every search. Conceivably, comparing hash strings (which is O(1)) could result in a big savings compared to comparing regular strings; but I expect most decent sql implementations already hash data internally, so rolling your own hash would be useless at best. If the OP's database is lacking, md5 is probably fine. Perhaps using a subset of the md5 (the low 32 bits, say) could speed up comparisons at risk of more collisions. Probably a good trade off unless the DB is humungous. Carl Banks -- http://mail.python.org/mailman/listinfo/python-list