Paul McGuire wrote:
> "Kaz Kylheku" <[EMAIL PROTECTED]> wrote in message
> news:[EMAIL PROTECTED]
> > Paddy wrote:
> > ...if you are lucky enough to have a "zero copy"
> > pipe implementation whcih allows data to go from the writer's buffer
> > directly to the reader's one without intermediate ker
Hi Kaz,
The 'Unix way' is to have lots of small utilities that do one thing
well, then connect them via pipes. It could be that the optimised sort
algorithm is hampered if it has to remove duplicates too, or that the
maintainers want to draw a line on added functionality.
Personally, 95%* of the t
"Grant Edwards" <[EMAIL PROTECTED]> wrote in message
news:[EMAIL PROTECTED]
> > Why would the second running of uniq remove any additional lines that
> > weren't removed in the first pass?
>
> Because uniq only removes _adjacent_ identical lines.
>
Thanks, guess my *nix ignorance is showing (this
"Kaz Kylheku" <[EMAIL PROTECTED]> wrote in message
news:[EMAIL PROTECTED]
> Paddy wrote:
> ...if you are lucky enough to have a "zero copy"
> pipe implementation whcih allows data to go from the writer's buffer
> directly to the reader's one without intermediate kernel buffering.
>
I love it when
Paddy wrote:
> If the log has a lot of repeated lines in its original state then
> running uniq twice, once up front to reduce what needs to be sorted,
> might be quicker?
Having the uniq and sort steps integrated in a single piece of software
allows for the most optimization opportunities.
The s
On 2006-05-19, Paul McGuire <[EMAIL PROTECTED]> wrote:
>> If the log has a lot of repeated lines in its original state then
>> running uniq twice, once up front to reduce what needs to be sorted,
>> might be quicker?
>>
>> uniq log_file | sort| uniq | wc -l
>>
>> - Pad.
>
> Why would the second r
On 2006-05-19, Kaz Kylheku <[EMAIL PROTECTED]> wrote:
> There should be one huge utility which can do it all in a single
> address space.
Sure, as long as it can do all of everything you'll ever need
to do, you're set! It would be the One True Program.
Isnt' that what Emacs is supposed to be?
"Paddy" <[EMAIL PROTECTED]> wrote in message
news:[EMAIL PROTECTED]
> If the log has a lot of repeated lines in its original state then
> running uniq twice, once up front to reduce what needs to be sorted,
> might be quicker?
>
> uniq log_file | sort| uniq | wc -l
>
> - Pad.
>
Why would the seco
If the log has a lot of repeated lines in its original state then
running uniq twice, once up front to reduce what needs to be sorted,
might be quicker?
uniq log_file | sort| uniq | wc -l
- Pad.
--
http://mail.python.org/mailman/listinfo/python-list
Bill Pursell wrote:
> Have you tried
> cat file | sort | uniq | wc -l ?
The standard input file descriptor of sort can be attached directly to
a file. You don't need a file catenating process in order to feed it:
sort < file | uniq | wc -l
And sort also takes a filename argument:
sort file
Bill Pursell wrote:
> Have you tried
> cat file | sort | uniq | wc -l ?
The standard input file descriptor of sort can be attached directly to
a file. You don't need a file catenating process in order to feed it:
sort < file | uniq | wc -l
Sort has the uniq functionality built in:
sort -u <
> I actually had this problem a couple of weeks ago when I
> discovered that my son's .Xsession file was 26 GB and had
> filled the disk partition (!). Apparently some games he was
> playing were spewing out a lot of errors, and I wanted to find
> out which ones were at fault.
>
> Basically
>
>It never occured to me to use the Python dict/set approach. Now I
>wonder if it would've worked better somehow. Of course my file was
>26,000 X larger than the one in this problem, and definitely would
>not fit in memory. I suspect that there were as many as a million
>duplicates for some me
Fredrik Lundh wrote:
>a for loop inside square brackets is a "list comprehension", and the
>result is a list. if you use a list comprehension inside a function
>call, the full list is built *before* the function is called. in this
>case, this would mean that the entire file would be read into
r.e.s. wrote:
> BTW, the first thing I tried was Fredrik Lundh's program:
>
> def number_distinct(fn):
> return len(set(s.strip() for s in open(fn)))
>
> which worked without the square brackets. Interesting that
> omitting them doesn't seem to matter.
a for loop inside square brackets is
A generator expression can "share" the parenthesis of a function call.
The syntax is explained in PEP 289, which is also in "What's new" in
the Python 2.4 docs.
Nice line of code!
--
http://mail.python.org/mailman/listinfo/python-list
"Tim Chase" <[EMAIL PROTECTED]> wrote ...
> 2) use a python set:
>
> s = set()
> for line in open("file.in"):
> s.add(line.strip())
> return len(s)
>
> 3) compact #2:
>
> return len(set([line.strip() for line in file("file.in")]))
>
> or, if stripping the lines isn't a concern, it can just be
r.e.s. wrote:
> I have a million-line text file with 100 characters per line,
> and simply need to determine how many of the lines are distinct.
>
> On my PC, this little program just goes to never-never land:
>
> def number_distinct(fn):
> f = file(fn)
> x = f.readline().strip()
> L
> I have a million-line text file with 100 characters per line,
> and simply need to determine how many of the lines are distinct.
A few ideas:
1) the shell way:
bash$ sort file.in | uniq | wc -l
This doesn't strip whitespace...a little sed magic would
strip off whitespace for you:
bash$ sed
r.e.s. wrote:
> I have a million-line text file with 100 characters per line,
> and simply need to determine how many of the lines are distinct.
>
> On my PC, this little program just goes to never-never land:
>
> def number_distinct(fn):
> f = file(fn)
> x = f.readline().strip()
> L =
r.e.s. wrote:
> I have a million-line text file with 100 characters per line,
> and simply need to determine how many of the lines are distinct.
>
> On my PC, this little program just goes to never-never land:
>
> def number_distinct(fn):
> f = file(fn)
> x = f.readline().strip()
> L
"r.e.s." <[EMAIL PROTECTED]> writes:
> I have a million-line text file with 100 characters per line,
> and simply need to determine how many of the lines are distinct.
I'd generalise it by allowing the caller to pass any iterable set of
items. A file handle can be iterated this way, but so can an
r.e.s. wrote:
> I have a million-line text file with 100 characters per line,
> and simply need to determine how many of the lines are distinct.
>
> On my PC, this little program just goes to never-never land:
>
> def number_distinct(fn):
> f = file(fn)
> x = f.readline().strip()
> L
I have a million-line text file with 100 characters per line,
and simply need to determine how many of the lines are distinct.
On my PC, this little program just goes to never-never land:
def number_distinct(fn):
f = file(fn)
x = f.readline().strip()
L = []
while x<>'':
if
24 matches
Mail list logo