Christos TZOTZIOY Georgiou wrote:
On Thu, 16 Dec 2004 14:28:21 +0000, rumours say that [EMAIL PROTECTED]
I challenge you to a benchmark :-)
Well, the numbers I provided above are almost meaningless with such a
small set (and they easily could be reverse, I just kept the
convenient-to-me first run :). Do you really believe that sorting three
files and then scanning their merged output counting duplicates is
faster than scanning two files (and doing lookups during the second
scan)?
$ python
Python 2.3.3 (#1, Aug 31 2004, 13:51:39)
[GCC 3.3.3 (SuSE Linux)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
x=open('/usr/share/dict/words').readlines()
len(x)
45378
import random
random.shuffle(x)
open("/tmp/A", "w").writelines(x)
random.shuffle(x)
open("/tmp/B", "w").writelines(x[:1000])
$ time sort A B B | uniq -u >/dev/null
real 0m0.311s
user 0m0.315s
sys 0m0.008s
$ time grep -Fvf B A >/dev/null
real 0m0.067s
user 0m0.064s
sys 0m0.003s
(Yes, I cheated by adding the F (for no regular expressions) flag :)
Also you only have 1000 entries in B!
Try it again with all entries in B also ;-)
Remember the original poster had 100K entries!
and finally destroys original line
order (should it be important).
true
That's our final agreement :)
Note the order is trivial to restore with a
"decorate-sort-undecorate" idiom.
--
Pádraig Brady - http://www.pixelbeat.org
--
--
http://mail.python.org/mailman/listinfo/python-list