On 2009-02-06 09:10, Curt Hash wrote: > I'm writing a small application for detecting source code plagiarism that > currently relies on a database to store lines of code. > > The application has two primary functions: adding a new file to the database > and comparing a file to those that are already stored in the database. > > I started out using sqlite3, but was not satisfied with the performance > results. I then tried using psycopg2 with a local postgresql server, and the > performance got even worse. My simple benchmarks show that sqlite3 is an > average of 3.5 times faster at inserting a file, and on average less than a > tenth of a second slower than psycopg2 at matching a file. > > I expected postgresql to be a lot faster ... is there some peculiarity in > psycopg2 that could be causing slowdown? Are these performance results > typical? Any suggestions on what to try from here? I don't think my > code/queries are inherently slow, but I'm not a DBA or a very accomplished > Python developer, so I could be wrong. > > Any advice is appreciated.
In general, if you do bulk insert into a large table, you should consider turning off indexing on the table and recreate/update the indexes in one go afterwards. But regardless of this detail, I think you should consider a filesystem based approach. This is going to be a lot faster than using a database to store the source code line by line. You can still use a database for the administration and indexing of the data, e.g. by storing a hash of each line in the database. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Feb 06 2009) >>> Python/Zope Consulting and Support ... http://www.egenix.com/ >>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ >>> mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/ ________________________________________________________________________ ::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ -- http://mail.python.org/mailman/listinfo/python-list