On Thu, 16 Dec 2004 14:28:21 +0000, rumours say that [EMAIL PROTECTED] might have written:
[sf] >>>>Essentially, want to do efficient grep, i..e from A remove those lines which >>>>are also present in file B. [EMAIL PROTECTED] >>>You could implement elegantly using the new sets feature >>>For reference here is the unix way to do it: >>> >>>sort a b b | uniq -u [Christos] >> No, like I just wrote in another post, he wants >> >> $ grep -vf B A >> >> I think that >> >> $ sort A B B | uniq -u >> >> can be abbreviated to >> >> $ sort -u A B B >> >> which is the union rather than the intersection of the files [EMAIL PROTECTED] >wrong. Notice the -u option to uniq. >http://marc.free.net.ph/message/20021101.043138.1bc24964.html I see your point. That's a new to me use of uniq, since I started using Unices long before GNU versions of the tools, but then, I might have missed the -u option. $ cat A aa ss dd ff gg hh jj kk $ cat B aa dd gg jj pp $ time sort A B B | uniq -u ff hh kk ss real 0m0.004s user 0m0.004s sys 0m0.004s $ time grep -vf B A ss ff hh kk real 0m0.003s user 0m0.000s sys 0m0.003s So I stand corrected that your solution does *not* give the union. >> wastes some time by considering B twice > >I challenge you to a benchmark :-) Well, the numbers I provided above are almost meaningless with such a small set (and they easily could be reverse, I just kept the convenient-to-me first run :). Do you really believe that sorting three files and then scanning their merged output counting duplicates is faster than scanning two files (and doing lookups during the second scan)? $ python Python 2.3.3 (#1, Aug 31 2004, 13:51:39) [GCC 3.3.3 (SuSE Linux)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> x=open('/usr/share/dict/words').readlines() >>> len(x) 45378 >>> import random >>> random.shuffle(x) >>> open("/tmp/A", "w").writelines(x) >>> random.shuffle(x) >>> open("/tmp/B", "w").writelines(x[:1000]) >>> $ time sort A B B | uniq -u >/dev/null real 0m0.311s user 0m0.315s sys 0m0.008s $ time grep -Fvf B A >/dev/null real 0m0.067s user 0m0.064s sys 0m0.003s (Yes, I cheated by adding the F (for no regular expressions) flag :) >> and finally destroys original line >> order (should it be important). > >true That's our final agreement :) -- TZOTZIOY, I speak England very best. "Be strict when sending and tolerant when receiving." (from RFC1958) I really should keep that in mind when talking with people, actually... -- http://mail.python.org/mailman/listinfo/python-list