Fredrik Lundh wrote: >a for loop inside square brackets is a "list comprehension", and the >result is a list. if you use a list comprehension inside a function >call, the full list is built *before* the function is called. in this >case, this would mean that the entire file would be read into memory >before the set was constructed. > >if you change the square brackets to ordinary parentheses, you get a >generator expression instead: > > http://pyref.infogami.com/generator-expressions > >the generator expression results in an iterator object that calculates >the values one by one. if you pass it to a function that expects an >iterator, that function will end up "running the for loop itself", and >no extra storage is needed. (in this case, you still need memory to >hold the set, of course, so the difference between a list comprehension >and a generator expression will only matter if you have lots of duplicates). > > This is interesting. I wonder how this compares to uniq in performance?
I actually had this problem a couple of weeks ago when I discovered that my son's .Xsession file was 26 GB and had filled the disk partition (!). Apparently some games he was playing were spewing out a lot of errors, and I wanted to find out which ones were at fault. Basically, uniq died on this task (well, it probably was working, but not completed after over 10 hours). I was using it something like this: cat Xsession.errors | uniq > Xsession.uniq It never occured to me to use the Python dict/set approach. Now I wonder if it would've worked better somehow. Of course my file was 26,000 X larger than the one in this problem, and definitely would not fit in memory. I suspect that there were as many as a million duplicates for some messages in that file. Would the generator version above have helped me out, I wonder? Unfortunately, I deleted the file, so I can't really try it out. I suppose I could create synthetic data with the logging module to try it out. Cheers, Terry -- Terry Hancock ([EMAIL PROTECTED]) Anansi Spaceworks http://www.AnansiSpaceworks.com -- http://mail.python.org/mailman/listinfo/python-list