On 2014-05-22, Peter Otten wrote: > Adam Funk wrote: > >> I'm using Python 3.3 and the sqlite3 module in the standard library. >> I'm processing a lot of strings from input files (among other things, >> values of headers in e-mail & news messages) and suppressing >> duplicates using a table of seen strings in the database. >> >> It seems to me --- from past experience with other things, where >> testing integers for equality is faster than testing strings, as well >> as from reading the SQLite3 documentation about INTEGER PRIMARY KEY >> --- that the SELECT tests should be faster if I am looking up an >> INTEGER PRIMARY KEY value rather than TEXT PRIMARY KEY. Is that >> right? > > My gut feeling tells me that this would matter more for join operations than > lookup of a value. If you plan to do joins you could use an autoinc integer > as the primary key and an additional string key for lookup.
I'm not doing any join operations. I'm using sqlite3 for storing big piles of data & persistence between runs --- not really "proper relational database use". In this particular case, I'm getting header values out of messages & doing this: for this_string in these_strings: if not already_seen(this_string): process(this_string) # ignore if already seen ... > and only if you can demonstrate a significant speedup keep the complication > in your code. > > If you find such a speedup I'd like to see the numbers because this cries > PREMATURE OPTIMIZATION... On further reflection, I think I asked for that. In fact, the table I'm using only has one column for the hashes --- I wasn't going to store the strings at all in order to save disk space (maybe my mind is stuck in the 1980s). -- But the government always tries to coax well-known writers into the Establishment; it makes them feel educated. [Robert Graves] -- https://mail.python.org/mailman/listinfo/python-list