Re: number of different lines in a file

2006-05-19 Thread Paddy
Paul McGuire wrote: > "Kaz Kylheku" <[EMAIL PROTECTED]> wrote in message > news:[EMAIL PROTECTED] > > Paddy wrote: > > ...if you are lucky enough to have a "zero copy" > > pipe implementation whcih allows data to go from the writer's buffer > > directly to the reader's one without intermediate ker

Re: number of different lines in a file

2006-05-19 Thread Paddy
Hi Kaz, The 'Unix way' is to have lots of small utilities that do one thing well, then connect them via pipes. It could be that the optimised sort algorithm is hampered if it has to remove duplicates too, or that the maintainers want to draw a line on added functionality. Personally, 95%* of the t

Re: number of different lines in a file

2006-05-19 Thread Paul McGuire
"Grant Edwards" <[EMAIL PROTECTED]> wrote in message news:[EMAIL PROTECTED] > > Why would the second running of uniq remove any additional lines that > > weren't removed in the first pass? > > Because uniq only removes _adjacent_ identical lines. > Thanks, guess my *nix ignorance is showing (this

Re: number of different lines in a file

2006-05-19 Thread Paul McGuire
"Kaz Kylheku" <[EMAIL PROTECTED]> wrote in message news:[EMAIL PROTECTED] > Paddy wrote: > ...if you are lucky enough to have a "zero copy" > pipe implementation whcih allows data to go from the writer's buffer > directly to the reader's one without intermediate kernel buffering. > I love it when

Re: number of different lines in a file

2006-05-19 Thread Kaz Kylheku
Paddy wrote: > If the log has a lot of repeated lines in its original state then > running uniq twice, once up front to reduce what needs to be sorted, > might be quicker? Having the uniq and sort steps integrated in a single piece of software allows for the most optimization opportunities. The s

Re: number of different lines in a file

2006-05-19 Thread Grant Edwards
On 2006-05-19, Paul McGuire <[EMAIL PROTECTED]> wrote: >> If the log has a lot of repeated lines in its original state then >> running uniq twice, once up front to reduce what needs to be sorted, >> might be quicker? >> >> uniq log_file | sort| uniq | wc -l >> >> - Pad. > > Why would the second r

Re: number of different lines in a file

2006-05-19 Thread Grant Edwards
On 2006-05-19, Kaz Kylheku <[EMAIL PROTECTED]> wrote: > There should be one huge utility which can do it all in a single > address space. Sure, as long as it can do all of everything you'll ever need to do, you're set! It would be the One True Program. Isnt' that what Emacs is supposed to be?

Re: number of different lines in a file

2006-05-19 Thread Paul McGuire
"Paddy" <[EMAIL PROTECTED]> wrote in message news:[EMAIL PROTECTED] > If the log has a lot of repeated lines in its original state then > running uniq twice, once up front to reduce what needs to be sorted, > might be quicker? > > uniq log_file | sort| uniq | wc -l > > - Pad. > Why would the seco

Re: number of different lines in a file

2006-05-19 Thread Paddy
If the log has a lot of repeated lines in its original state then running uniq twice, once up front to reduce what needs to be sorted, might be quicker? uniq log_file | sort| uniq | wc -l - Pad. -- http://mail.python.org/mailman/listinfo/python-list

Re: number of different lines in a file

2006-05-19 Thread Kaz Kylheku
Bill Pursell wrote: > Have you tried > cat file | sort | uniq | wc -l ? The standard input file descriptor of sort can be attached directly to a file. You don't need a file catenating process in order to feed it: sort < file | uniq | wc -l And sort also takes a filename argument: sort file

Re: number of different lines in a file

2006-05-19 Thread Kaz Kylheku
Bill Pursell wrote: > Have you tried > cat file | sort | uniq | wc -l ? The standard input file descriptor of sort can be attached directly to a file. You don't need a file catenating process in order to feed it: sort < file | uniq | wc -l Sort has the uniq functionality built in: sort -u <

Re: number of different lines in a file

2006-05-19 Thread Tim Chase
> I actually had this problem a couple of weeks ago when I > discovered that my son's .Xsession file was 26 GB and had > filled the disk partition (!). Apparently some games he was > playing were spewing out a lot of errors, and I wanted to find > out which ones were at fault. > > Basically

Re: number of different lines in a file

2006-05-19 Thread Ben Stroud
> >It never occured to me to use the Python dict/set approach. Now I >wonder if it would've worked better somehow. Of course my file was >26,000 X larger than the one in this problem, and definitely would >not fit in memory. I suspect that there were as many as a million >duplicates for some me

Re: number of different lines in a file

2006-05-19 Thread Terry Hancock
Fredrik Lundh wrote: >a for loop inside square brackets is a "list comprehension", and the >result is a list. if you use a list comprehension inside a function >call, the full list is built *before* the function is called. in this >case, this would mean that the entire file would be read into

Re: number of different lines in a file

2006-05-18 Thread Fredrik Lundh
r.e.s. wrote: > BTW, the first thing I tried was Fredrik Lundh's program: > > def number_distinct(fn): > return len(set(s.strip() for s in open(fn))) > > which worked without the square brackets. Interesting that > omitting them doesn't seem to matter. a for loop inside square brackets is

Re: number of different lines in a file

2006-05-18 Thread pac
A generator expression can "share" the parenthesis of a function call. The syntax is explained in PEP 289, which is also in "What's new" in the Python 2.4 docs. Nice line of code! -- http://mail.python.org/mailman/listinfo/python-list

Re: number of different lines in a file

2006-05-18 Thread r.e.s.
"Tim Chase" <[EMAIL PROTECTED]> wrote ... > 2) use a python set: > > s = set() > for line in open("file.in"): > s.add(line.strip()) > return len(s) > > 3) compact #2: > > return len(set([line.strip() for line in file("file.in")])) > > or, if stripping the lines isn't a concern, it can just be

Re: number of different lines in a file

2006-05-18 Thread Andrew Robert
r.e.s. wrote: > I have a million-line text file with 100 characters per line, > and simply need to determine how many of the lines are distinct. > > On my PC, this little program just goes to never-never land: > > def number_distinct(fn): > f = file(fn) > x = f.readline().strip() > L

Re: number of different lines in a file

2006-05-18 Thread Tim Chase
> I have a million-line text file with 100 characters per line, > and simply need to determine how many of the lines are distinct. A few ideas: 1) the shell way: bash$ sort file.in | uniq | wc -l This doesn't strip whitespace...a little sed magic would strip off whitespace for you: bash$ sed

Re: number of different lines in a file

2006-05-18 Thread Bill Pursell
r.e.s. wrote: > I have a million-line text file with 100 characters per line, > and simply need to determine how many of the lines are distinct. > > On my PC, this little program just goes to never-never land: > > def number_distinct(fn): > f = file(fn) > x = f.readline().strip() > L =

Re: number of different lines in a file

2006-05-18 Thread Fredrik Lundh
r.e.s. wrote: > I have a million-line text file with 100 characters per line, > and simply need to determine how many of the lines are distinct. > > On my PC, this little program just goes to never-never land: > > def number_distinct(fn): > f = file(fn) > x = f.readline().strip() > L

Re: number of different lines in a file

2006-05-18 Thread Ben Finney
"r.e.s." <[EMAIL PROTECTED]> writes: > I have a million-line text file with 100 characters per line, > and simply need to determine how many of the lines are distinct. I'd generalise it by allowing the caller to pass any iterable set of items. A file handle can be iterated this way, but so can an

Re: number of different lines in a file

2006-05-18 Thread Larry Bates
r.e.s. wrote: > I have a million-line text file with 100 characters per line, > and simply need to determine how many of the lines are distinct. > > On my PC, this little program just goes to never-never land: > > def number_distinct(fn): > f = file(fn) > x = f.readline().strip() > L

number of different lines in a file

2006-05-18 Thread r.e.s.
I have a million-line text file with 100 characters per line, and simply need to determine how many of the lines are distinct. On my PC, this little program just goes to never-never land: def number_distinct(fn): f = file(fn) x = f.readline().strip() L = [] while x<>'': if