Gabriel, I think that would work well. Also, thank you for suggesting the use of filecmp. I have never used this module, but it looks like a much better solution than what I had been doing previously--using os.stat and performing a DB lookup in order to verify that the filename and timestamp existed in a 'file update' table. Also, if the only limitation to difflib is that both files reside in memory, I should be fine. The largest of all of these files is just over 200k, which should be fine. If memory serves me right, I can't use more than 4MB, so I should be fine. And, if I spawn separate processes for generating the delta files, I should be able to speed things up even more.
Thanks again for your help! Best Regards, Christopher Nethery "Gabriel Genellina" <[EMAIL PROTECTED]> wrote in message news:[EMAIL PROTECTED] > En Thu, 12 Apr 2007 23:51:22 -0300, Chris Nethery <[EMAIL PROTECTED]> > escribió: > >> Yes, they are tab-delimited text files that will change very little >> throughout the day. >> But, this is messy, antiquated 80s junk, nonetheless. > > Ugh... > >> Rows are designated either by a row type or they contain a comment. Each >> row type has an identity value, but the 'comment' rows do not. The >> comment >> rows, however, are logically associated with the last occurring row type. >> When I generate my bulk insert file, I add the identity of the last >> occurring row type to the comment rows, and generate and populate an >> additional identity column in order to retain the order of the comments. >> Generally rows will either be added or changed, but sometimes rows will >> be >> removed. Typically, only 1-5 new rows will be added to a file in a given >> day, but users sometimes make manual corrections/deletions to older rows >> and >> sometimes certain column values are recalculated. > > http://tgolden.sc.sabren.com/python/win32_how_do_i/watch_directory_for_changes.html > > You could keep a copy of all files - let's say, as they were yesterday. > When you want to process the changes, iterate over all files and see if > they are newer than your copy. You could use the filecmp module: > http://docs.python.org/lib/module-filecmp.html > For each modified file: load it, and process the comments adding the > associated row type and the identity. Just do the same with the > "yesterday" file. (I assume they're not so big that you can keep both in > memory). You have then two lists of lines; then, use the functions in > module difflib to detect the changed lines; based on those results, > generate your database inserts/deletes/updates. > > This way you will not process the unchanged files, and inside each file, > you will ignore unchanged lines. At least in principle it should be faster > than redoing all from scratch each time... > >> Did I mention that the header contains another implied hierarchy? >> Fortunately, I can just ignore it and strip it off. > > good - I imagine it's enough work as it is now... > > -- > Gabriel Genellina >
-- http://mail.python.org/mailman/listinfo/python-list