I'm using Python 3.3 and the sqlite3 module in the standard library. I'm processing a lot of strings from input files (among other things, values of headers in e-mail & news messages) and suppressing duplicates using a table of seen strings in the database.
It seems to me --- from past experience with other things, where testing integers for equality is faster than testing strings, as well as from reading the SQLite3 documentation about INTEGER PRIMARY KEY --- that the SELECT tests should be faster if I am looking up an INTEGER PRIMARY KEY value rather than TEXT PRIMARY KEY. Is that right? If so, what sort of hashing function should I use? The "maxint" for SQLite3 is a lot smaller than the size of even MD5 hashes. The only thing I've thought of so far is to use MD5 or SHA-something modulo the maxint value. (Security isn't an issue --- i.e., I'm not worried about someone trying to create a hash collision.) Thanks, Adam -- "It is the role of librarians to keep government running in difficult times," replied Dramoren. "Librarians are the last line of defence against chaos." (McMullen 2001) -- https://mail.python.org/mailman/listinfo/python-list