Re: deduping

Dave Angel Mon, 21 Jun 2010 06:34:46 -0700

dirknbr wrote:

Hi


I have 2 files (done and outf), and I want to chose unique elements
from the 2nd column in outf which are not in done. This code works but
is not efficient, can you think of a quicker way? The a=1 is just a
redundant task obviously, I put it this way around because I think
'in' is quicker than 'not in' - is that true?

done_={}
for line in done:
    done_[line.strip()]=0

print len(done_)

universe={}
for line in outf:
    if line.split(',')[1].strip() in universe.keys():
        a=1
    else:
        if line.split(',')[1].strip() in done_.keys():
            a=1
        else:
            universe[line.split(',')[1].strip()]=0

Dirk

Where you have a=1, one would normally use the "pass" statement. Butyou're wrong that 'not in' is less efficient than 'in'. If there's adifference, it's probably negligible, and almost certainly less than theextra else clause you're forcing here.

When doing an 'in', do *not* use the keys() method, as you're replacinga fast lookup with a slow one, not to mention the time it takes to buildthe keys() list each time.

In both these cases, you can use a set, rather than a dict. And there'sno need to test whether the item is already in the set, just put it inagain.


Changing all that, you'll wind up with something like (untested)

done_set = set()
universe = set()
for line in done:
   done_set.add(line.strip())
for line in outf:
   item = line.split(',')[1].strip()
   if item not in done_set
       universe.add(item)


DaveA

--
http://mail.python.org/mailman/listinfo/python-list

Re: deduping

Reply via email to