> I actually had this problem a couple of weeks ago when I > discovered that my son's .Xsession file was 26 GB and had > filled the disk partition (!). Apparently some games he was > playing were spewing out a lot of errors, and I wanted to find > out which ones were at fault. > > Basically, uniq died on this task (well, it probably was > working, but not completed after over 10 hours). I was using > it something like this: > > cat Xsession.errors | uniq > Xsession.uniq
A couple things I noticed that may play into matters: 1) uniq is a dedicated tool for the task of uniquely identifying *neighboring* lines in the file. It doesn't get much faster than that, *if* that's your input. This leads to #4 below. 2) (uneventfully?) you have a superfluous use of cat. I don't know if that's bogging matters down, but you can just use uniq < Xsession.errors > Xsession.uniq which would save you from having each line touched twice...once by cat, and once by uniq. 3) as "uniq" doesn't report on its progress, if it's processing a humongous 26 gig file, it may just sit there churning for a long time before finishing. It looks like it may have taken >10hr :) 4) "uniq" requires sorted input. Unless you've sorted your Xsession.errors before-hand, your output isn't likely to be as helpful. The python set/generator scheme may work well to keep you from having to sort matters first--especially if you only have a fairly scant handful of unique errors. 5) I presume wherever you were writing Xsession.uniq had enough space...you mentioned your son filling your HDD. It may gasp, wheeze and die if there wasn't enough space...or it might just hang. I'd hope it would be smart enough to gracefully report "out of disk-space" errors in the process. 6) unless I'm experiencing trouble, I just tend to keep my .xsession-errors file as a soft-link to /dev/null, especially as (when I use KDE rather than Fluxbox) KDE likes to spit out mountains of KIO file errors. It's easy enough to unlink it and let it generate the file if needed. 7) With a file this large, you most certainly want to use a generator scheme rather than trying to load each of the lines in the file :) (Note to Bruno...yes, *this* would be one of those places you mentioned to me earlier about *not* using readlines() ;) If you're using 2.3.x, and don't have 2.4's nice syntax for len(set(line.strip() for line in file("xsession.errors"))) you should be able to bypass reading the whole file into memory (and make use of sets) with from sets import Set as set s = set() for line in file("xsession.errors"): s.add(line.strip()) return len(s) In your case, you likely don't have to call strip() and can just get away with adding each line to the set. Just a few ideas for the next time you have a multi-gig Xsession.errors situation :) -tkc -- http://mail.python.org/mailman/listinfo/python-list