On Thu, 22 May 2014 12:47:31 +0100, Adam Funk wrote: > I'm using Python 3.3 and the sqlite3 module in the standard library. I'm > processing a lot of strings from input files (among other things, values > of headers in e-mail & news messages) and suppressing duplicates using a > table of seen strings in the database. > > It seems to me --- from past experience with other things, where testing > integers for equality is faster than testing strings, as well as from > reading the SQLite3 documentation about INTEGER PRIMARY KEY --- that the > SELECT tests should be faster if I am looking up an INTEGER PRIMARY KEY > value rather than TEXT PRIMARY KEY. Is that right? > > If so, what sort of hashing function should I use? The "maxint" for > SQLite3 is a lot smaller than the size of even MD5 hashes. The only > thing I've thought of so far is to use MD5 or SHA-something modulo the > maxint value. (Security isn't an issue --- i.e., I'm not worried about > someone trying to create a hash collision.) > > Thanks, > Adam
why not just set the filed in the DB to be unique & then catch the error when you try to Wright a duplicate? let the DB engine handle the task -- Your step will soil many countries. -- https://mail.python.org/mailman/listinfo/python-list