Christos TZOTZIOY Georgiou wrote:
On Thu, 16 Dec 2004 14:28:21 +0000, rumours say that [EMAIL PROTECTED]
I challenge you to a benchmark :-)


Well, the numbers I provided above are almost meaningless with such a
small set (and they easily could be reverse, I just kept the
convenient-to-me first run :).  Do you really believe that sorting three
files and then scanning their merged output counting duplicates is
faster than scanning two files (and doing lookups during the second
scan)?

$ python
Python 2.3.3 (#1, Aug 31 2004, 13:51:39)
[GCC 3.3.3 (SuSE Linux)] on linux2
Type "help", "copyright", "credits" or "license" for more information.

x=open('/usr/share/dict/words').readlines()
len(x)

45378

import random
random.shuffle(x)
open("/tmp/A", "w").writelines(x)
random.shuffle(x)
open("/tmp/B", "w").writelines(x[:1000])


$ time sort A B B | uniq -u >/dev/null

real    0m0.311s
user    0m0.315s
sys     0m0.008s
$ time grep -Fvf B A >/dev/null

real    0m0.067s
user    0m0.064s
sys     0m0.003s

(Yes, I cheated by adding the F (for no regular expressions) flag :)

Also you only have 1000 entries in B! Try it again with all entries in B also ;-) Remember the original poster had 100K entries!

and finally destroys original line
order (should it be important).

true

That's our final agreement :)

Note the order is trivial to restore with a "decorate-sort-undecorate" idiom.

--
Pádraig Brady - http://www.pixelbeat.org
--
--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to