Fredrik Lundh wrote:

>a for loop inside square brackets is a "list comprehension", and the
>result is a list.  if you use a list comprehension inside a function 
>call, the full list is built *before* the function is called.  in this 
>case, this would mean that the entire file would be read into memory 
>before the set was constructed.
>
>if you change the square brackets to ordinary parentheses, you get a 
>generator expression instead:
>
>     http://pyref.infogami.com/generator-expressions
>
>the generator expression results in an iterator object that calculates 
>the values one by one.  if you pass it to a function that expects an 
>iterator, that function will end up "running the for loop itself", and 
>no extra storage is needed.  (in this case, you still need memory to 
>hold the set, of course, so the difference between a list comprehension 
>and a generator expression will only matter if you have lots of duplicates).
>  
>
This is interesting. I wonder how this compares to uniq in
performance?

I actually had this problem a couple of weeks ago when I discovered
that my son's .Xsession file was 26 GB and had filled the disk
partition (!).  Apparently some games he was playing were spewing
out a lot of errors, and I wanted to find out which ones were at fault.

Basically, uniq died on this task (well, it probably was working, but
not completed after over 10 hours).  I was using it something like
this:

cat Xsession.errors | uniq > Xsession.uniq

It never occured to me to use the Python dict/set approach.  Now I
wonder if it would've worked better somehow.  Of course my file was
26,000 X larger than the one in this problem, and definitely would
not fit in memory.  I suspect that there were as many as a million
duplicates for some messages in that file.  Would the generator
version above have helped me out, I wonder?

Unfortunately, I deleted the file, so I can't really try it out. I suppose
I could create synthetic data with the logging module to try it out.

Cheers,
Terry

-- 
Terry Hancock ([EMAIL PROTECTED])
Anansi Spaceworks http://www.AnansiSpaceworks.com


-- 
http://mail.python.org/mailman/listinfo/python-list

Reply via email to