[issue13653] reorder set.intersection parameters for better performance

Andrew Dalke Fri, 23 Dec 2011 04:58:22 -0800

Andrew Dalke <da...@dalkescientific.com> added the comment:

My belief is that the people who use set.intersection with more than two terms 
are 1) going to pass in a list of sets, and 2) don't care about the specific 
order.


To check the validity of my belief, I did a Google Code Search to find cases of 
people using set intersection in Python. I searched for "set\.intersection\(\*" 
and "\.intersection\(.*\, lang:^python$", among others.

I am sad to report that the most common way to compute set.intersection(*list) 
is by using reduce, like:

    possible = (set(index[c]) for c in set(otp))
    possible = reduce(lambda a, b: a.intersection(b), possible)


That comes from:
  git://github.com/Kami/python-yubico-client.git /yubico/modhex.py
and similar uses are in:
  git://github.com/sburns/PyCap.git /redcap/rc.py
  http://hltdi-l3.googlecode.com/hg//xdg/languages/morpho/fst.py
  http://dsniff.googlecode.com/svn/trunk/dsniff/lib/fcap.py


As well as in the Rosetta Code example for a simple inverted index, at:
  http://rosettacode.org/wiki/Inverted_index#Python

This was also implemented more verbosely in:

http://eats.googlecode.com/svn/trunk/server/eats/views/main.py
    intersected_set = sets[0]
    for i in range(1, len(sets)):
        intersected_set = intersected_set.intersection(sets[i])

and 

http://iocbio.googlecode.com/svn/trunk/iocbio/microscope/cluster_tools.py
        s = set (range (len (data[0])))
        for d in zip(*data):
            s = s.intersection(set(find_outliers(d, zoffset=zoffset)))
        return sorted(s)

In other words, 7 codebases use manual pairwise reduction rather than use the 
equivalent code in Python. (I have not checked for which are due to backwards 
compatibility requirements.)

On the other hand, if someone really wants to have a specific intersection 
order, this shows that it's very easy to write.


I found 4 other code bases where set intersection was used for something other 
than binary intersection, and used the built-in intersection().



git://github.com/valda/wryebash.git/experimental/bait/bait/presenter/impl/filters.py
    def get_visible_node_ids(self, filterId):
        if filterId in self.idMask:
            visibleNodeIdSets = [f.get_visible_node_ids(filterId) for f in 
self._filters]
            return set.intersection(*[v for v in visibleNodeIdSets if v is not 
None])
        return None



http://wallproxy.googlecode.com/svn/trunk/local/proxy.py
                if threads[ct].intersection(*threads.itervalues()):
                    raise ValueError('All threads failed')
(here, threads' values contain sets)



git://github.com/argriffing/xgcode.git/20100623a.py
    header_sets = [set(x) for x in header_list]
    header_intersection = set.intersection(*header_sets)




http://pyvenn.googlecode.com/hg//venn.py
            to_exclude = set()
            for ii in xrange(0, len(self.sets)):
                if (i & (2**ii)):
                    sets_to_intersect.append(sets_by_power_of_two[i & (2**ii)])
                else:
                    to_exclude = to_exclude.union(sets_by_power_of_two[(2**ii)])
            final = set.intersection(*sets_to_intersect) - to_exclude



These all find the intersection of sets (not iterators), and the order of 
evaluation does not appear like it will affect the result.

I do not know though if there will be a performance advantage in these cases to 
reordering. I do know that in my code, and any inverted index, there is an 
advantage.

And I do know that the current CPython implementation has bad worst-case 
performance.

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue13653>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue13653] reorder set.intersection parameters for better performance

Reply via email to