On Fri, Feb 6, 2009 at 5:19 AM, M.-A. Lemburg <m...@egenix.com> wrote: > On 2009-02-06 09:10, Curt Hash wrote: >> I'm writing a small application for detecting source code plagiarism that >> currently relies on a database to store lines of code. >> >> The application has two primary functions: adding a new file to the database >> and comparing a file to those that are already stored in the database. >> >> I started out using sqlite3, but was not satisfied with the performance >> results. I then tried using psycopg2 with a local postgresql server, and the >> performance got even worse. My simple benchmarks show that sqlite3 is an >> average of 3.5 times faster at inserting a file, and on average less than a >> tenth of a second slower than psycopg2 at matching a file. >> >> I expected postgresql to be a lot faster ... is there some peculiarity in >> psycopg2 that could be causing slowdown? Are these performance results >> typical? Any suggestions on what to try from here? I don't think my >> code/queries are inherently slow, but I'm not a DBA or a very accomplished >> Python developer, so I could be wrong. >> >> Any advice is appreciated. > > In general, if you do bulk insert into a large table, you should consider > turning off indexing on the table and recreate/update the indexes in one > go afterwards. > > But regardless of this detail, I think you should consider a filesystem > based approach. This is going to be a lot faster than using a > database to store the source code line by line. You can still use > a database for the administration and indexing of the data, e.g. > by storing a hash of each line in the database. >
I can see how reconstructing source code from individual lines in the database would be much slower than a filesystem-based approach. However, what is of particular importance is that the matching itself be fast. While the original lines of code are stored in the database, I am performing matching based on only hashes. Would storing the original code in the same table as the hash cause significant slowdown if I am querying by hash only? I think I may try this approach anyways, just to make retrieving the original source code after finding a match faster, but I am still primarily concerned with the speed of the hash lookups. > -- > Marc-Andre Lemburg > eGenix.com > > Professional Python Services directly from the Source (#1, Feb 06 2009) >>>> Python/Zope Consulting and Support ... http://www.egenix.com/ >>>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ >>>> mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/ > ________________________________________________________________________ > > ::: Try our new mxODBC.Connect Python Database Interface for free ! :::: > > > eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 > D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg > Registered at Amtsgericht Duesseldorf: HRB 46611 > http://www.egenix.com/company/contact/ > -- http://mail.python.org/mailman/listinfo/python-list