On 2014-05-22, Chris Angelico wrote: > On Thu, May 22, 2014 at 9:47 PM, Adam Funk <a24...@ducksburg.com> wrote: >> I'm using Python 3.3 and the sqlite3 module in the standard library. >> I'm processing a lot of strings from input files (among other things, >> values of headers in e-mail & news messages) and suppressing >> duplicates using a table of seen strings in the database. >> >> It seems to me --- from past experience with other things, where >> testing integers for equality is faster than testing strings, as well >> as from reading the SQLite3 documentation about INTEGER PRIMARY KEY >> --- that the SELECT tests should be faster if I am looking up an >> INTEGER PRIMARY KEY value rather than TEXT PRIMARY KEY. Is that >> right? > > It might be faster to use an integer primary key, but the possibility > of even a single collision means you can't guarantee uniqueness > without a separate check. I don't know sqlite3 well enough to say, but > based on what I know of PostgreSQL, it's usually best to make your > schema mimic your logical structure, rather than warping it for the > sake of performance. With a good indexing function, the performance of > a textual PK won't be all that much worse than an integral one, and > everything you do will read correctly in the code - no fiddling around > with hashes and collision checks. > > Stick with the TEXT PRIMARY KEY and let the database do the database's > job. If you're processing a really large number of strings, you might > want to consider moving from sqlite3 to PostgreSQL anyway (I've used > psycopg2 quite happily), as you'll get better concurrency; and that > might solve your performance problem as well, as Pg plays very nicely > with caches.
Well, actually I'm thinking about doing away with checking for duplicates at this stage, since the substrings that I pick out of the deduplicated header values go into another table as the TEXT PRIMARY KEY anyway, with deduplication there. So I think this stage reeks of premature optimization. -- The history of the world is the history of a privileged few. --- Henry Miller -- https://mail.python.org/mailman/listinfo/python-list