Using python to delta-load files into a central DB

2007-04-12 Thread Chris Nethery
Hello everyone,

I have a challenging issue I need to overcome and was hoping I might gain 
some insights from this group.

I am trying to speed up the process I am using, which is as follows:

1) I have roughly 700 files that are modified throughout the day by users, 
within a separate application

2) As modifications are made to the files, I use a polling service and mimic 
the lock-file strategy used by the separate software application

3) I generate a single 'load' file and bulk insert into a load table

4) I update/insert/delete from the load table

This is just too time consuming, in my opinion.

At present, users of the separate application can run recalculation 
functions that modify all 700 files at once, causing my code to take the 
whole ball of wax, rather than just the data that has changed.

What I would like to do is spawn separate processes and load only the delta 
data.  The data must be 100% reliable, so I'm leary of using something like 
difflib.  I also want to make sure that my code scales since the number of 
files is ever-increasing.

I would be grateful for any feedback you could provide.


Thank you,

Chris Nethery 


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Using python to delta-load files into a central DB

2007-04-12 Thread Chris Nethery
Gabriel,

Thank you for your reply.

Yes, they are tab-delimited text files that will change very little 
throughout the day.

But, this is messy, antiquated 80s junk, nonetheless.

Rows are designated either by a row type or they contain a comment.  Each 
row type has an identity value, but the 'comment' rows do not.  The comment 
rows, however, are logically associated with the last occurring row type. 
When I generate my bulk insert file, I add the identity of the last 
occurring row type to the comment rows, and generate and populate an 
additional identity column in order to retain the order of the comments.

Generally rows will either be added or changed, but sometimes rows will be 
removed.  Typically, only 1-5 new rows will be added to a file in a given 
day, but users sometimes make manual corrections/deletions to older rows and 
sometimes certain column values are recalculated.

Did I mention that the header contains another implied hierarchy? 
Fortunately, I can just ignore it and strip it off.


Thank you,

Chris Nethery



"Gabriel Genellina" <[EMAIL PROTECTED]> wrote in message 
news:[EMAIL PROTECTED]
> En Thu, 12 Apr 2007 14:05:15 -0300, Chris Nethery <[EMAIL PROTECTED]> 
> escribió:
>
>> At present, users of the separate application can run recalculation
>> functions that modify all 700 files at once, causing my code to take the
>> whole ball of wax, rather than just the data that has changed.
>
> Are they text files, or what?
> What kind of modifications? some lines changed/deleted/added? a column 
> recalculated along the whole file?
>
>> What I would like to do is spawn separate processes and load only the 
>> delta
>> data.  The data must be 100% reliable, so I'm leary of using something 
>> like
>> difflib.  I also want to make sure that my code scales since the number 
>> of
>> files is ever-increasing.
>
> Why don't you like difflib? AFAIK it has no known bugs.
>
> -- 
> Gabriel Genellina
> 


-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Using python to delta-load files into a central DB

2007-04-13 Thread Chris Nethery
Gabriel,

I think that would work well.  Also, thank you for suggesting the use of 
filecmp.  I have never used this module, but it looks like a much better 
solution than what I had been doing previously--using os.stat and performing 
a DB lookup in order to verify that the filename and timestamp existed in a 
'file update' table.  Also, if the only limitation to difflib is that both 
files reside in memory, I should be fine.  The largest of all of these files 
is just over 200k, which should be fine.  If memory serves me right, I can't 
use more than 4MB, so I should be fine.  And, if I spawn separate processes 
for generating the delta files, I should be able to speed things up even 
more.

Thanks again for your help!


Best Regards,

Christopher Nethery



"Gabriel Genellina" <[EMAIL PROTECTED]> wrote in message 
news:[EMAIL PROTECTED]
> En Thu, 12 Apr 2007 23:51:22 -0300, Chris Nethery <[EMAIL PROTECTED]> 
> escribió:
>
>> Yes, they are tab-delimited text files that will change very little
>> throughout the day.
>> But, this is messy, antiquated 80s junk, nonetheless.
>
> Ugh...
>
>> Rows are designated either by a row type or they contain a comment.  Each
>> row type has an identity value, but the 'comment' rows do not.  The 
>> comment
>> rows, however, are logically associated with the last occurring row type.
>> When I generate my bulk insert file, I add the identity of the last
>> occurring row type to the comment rows, and generate and populate an
>> additional identity column in order to retain the order of the comments.
>> Generally rows will either be added or changed, but sometimes rows will 
>> be
>> removed.  Typically, only 1-5 new rows will be added to a file in a given
>> day, but users sometimes make manual corrections/deletions to older rows 
>> and
>> sometimes certain column values are recalculated.
>
> http://tgolden.sc.sabren.com/python/win32_how_do_i/watch_directory_for_changes.html
>
> You could keep a copy of all files - let's say, as they were yesterday.
> When you want to process the changes, iterate over all files and see if 
> they are newer than your copy. You could use the filecmp module: 
> http://docs.python.org/lib/module-filecmp.html
> For each modified file: load it, and process the comments adding the 
> associated row type and the identity. Just do the same with the 
> "yesterday" file. (I assume they're not so big that you can keep both in 
> memory). You have then two lists of lines; then, use the functions in 
> module difflib to detect the changed lines; based on those results, 
> generate your database inserts/deletes/updates.
>
> This way you will not process the unchanged files, and inside each file, 
> you will ignore unchanged lines. At least in principle it should be faster 
> than redoing all from scratch each time...
>
>> Did I mention that the header contains another implied hierarchy?
>> Fortunately, I can just ignore it and strip it off.
>
> good - I imagine it's enough work as it is now...
>
> -- 
> Gabriel Genellina
> 


-- 
http://mail.python.org/mailman/listinfo/python-list