On Thu, May 22, 2014 at 9:47 PM, Adam Funk <a24...@ducksburg.com> wrote: > I'm using Python 3.3 and the sqlite3 module in the standard library. > I'm processing a lot of strings from input files (among other things, > values of headers in e-mail & news messages) and suppressing > duplicates using a table of seen strings in the database. > > It seems to me --- from past experience with other things, where > testing integers for equality is faster than testing strings, as well > as from reading the SQLite3 documentation about INTEGER PRIMARY KEY > --- that the SELECT tests should be faster if I am looking up an > INTEGER PRIMARY KEY value rather than TEXT PRIMARY KEY. Is that > right?
It might be faster to use an integer primary key, but the possibility of even a single collision means you can't guarantee uniqueness without a separate check. I don't know sqlite3 well enough to say, but based on what I know of PostgreSQL, it's usually best to make your schema mimic your logical structure, rather than warping it for the sake of performance. With a good indexing function, the performance of a textual PK won't be all that much worse than an integral one, and everything you do will read correctly in the code - no fiddling around with hashes and collision checks. Stick with the TEXT PRIMARY KEY and let the database do the database's job. If you're processing a really large number of strings, you might want to consider moving from sqlite3 to PostgreSQL anyway (I've used psycopg2 quite happily), as you'll get better concurrency; and that might solve your performance problem as well, as Pg plays very nicely with caches. ChrisA -- https://mail.python.org/mailman/listinfo/python-list