Re: Slow down while creating a big list and iterating over it

Alf P. Steinbach Sat, 30 Jan 2010 14:12:34 -0800

* marc magrans de abril:

Dear colleagues,


I was doing a small program to classify log files for a cluster of
PCs, I just wanted to simplify a quite repetitive task in order to
find errors and so.

My first naive implementation was something like:
    patterns = []
    while(logs):
        pattern = logs[0]
        new_logs = [l for l in logs if dist(pattern,l)>THERESHOLD]
        entry = (len(logs)-len(new_logs),pattern)
        patterns.append(entry)
        logs = new_logs

Where dist(...) is the levenshtein distance (i.e. edit distance) and
logs is something like 1.5M logs (700 MB file). I thought that python
will be an easy choice although not really fast..

I was not surprised when the first iteration of the while loop was
taking ~10min. I thought "not bad, let's how much it takes". However,
it seemed that the second iteration never finished.

My surprise was big when I added a print instead of the list
comprehension:
new_logs=[]
for count,l in enumerate(logs):
   print count
   if dist(pattern,l)>THERESHOLD:
      new_logs.append(l)

The surprise was that the displayed counter was running ~10 times
slower on the second iteration of the while loop.

I am a little lost. Anyone knows the reson of this behavior?

It's on line 42 of your program. :-) That is, it's in the dist function.Evidently it doesn't like a more complex 'pattern'.

How should I write a program that deals with large data sets in python?

As in any other language. Try to avoid repeating the same computations. Try tomake the data fit the computational task.



Cheers & hth.,

- Alf
--
http://mail.python.org/mailman/listinfo/python-list

Re: Slow down while creating a big list and iterating over it

Reply via email to