On Nov 16, 2007 12:42 PM, Robin Becker <[EMAIL PROTECTED]> wrote: > I'm trying to get my head round a timing puzzle I find whilst optimizing A > Kuchling's version of the Knuth line splitting algorithm from the oedipus > project. The puzzle is as follows in highly abstract form (here > active_nodes is a list of objects kept in a sorted order, but there are no > special methods on the objects for comparison etc) > > > for i in xrange(m): > ....... > for A in active_nodes[:]: > ...... > if cond: > active_nodes.remove(A) > ...... > ..... > > is slow my test takes 2.7 seconds; profiling reveals we're calling A.remove > thousands of times and it's taking a lot of time. There appear to be no other > usages of active_nodes in the for A loop. > > > On a hunch I changed the above to > > removed = [] > aremoved = removed.append > for i in xrange(m): > ....... > for iA,A in enumerate(active_nodes): > ...... > if cond: > aremoved(iA) > ...... > if removed: > for iA in reversed(removed): > del active_nodes[iA] > del removed[:] > ...... > > this latter is 5 times faster, but I can't really see why. My only guess is > that > appending to the removed list is relatively easy compared with deletes and > the > reverse order delete is somehow moving less data about. I think in the > original > Knuth the deletion would have been from some real kind of list and O(1) > whereas > here deletions are O(n/2) on average. On the other hand we have to keep this > list in sorted order. What data structure should I be using? I should add > that I > tried using a __cmp__ method to assist in doing the sorted insert, but that > interfered with the simple active_nodes.remove. > --
remove() does a linear search to find the item to remove, so it's O(n) + the actual deletion. Append() is amortized O(1) (overcommit). If you delete by index instead: for idx, node in active_nodes: if cond: del active_nodes[idx] You should see performance comparable to your second option. If you can trade memory for performance, discarding the old active_nodes might be even faster: active_nodes = [node for node in active_nodes if node not in deleted_nodes], where deleted_nodes is a set. Normal micro-optimization techniques apply here too, like looking up the remove() method ahead of time and so on. > Robin Becker > > -- > http://mail.python.org/mailman/listinfo/python-list > -- http://mail.python.org/mailman/listinfo/python-list