Raymond Hettinger <raymond.hettin...@gmail.com> added the comment:

Addenda [reposted to fix a premature submission].

Top use cases for sets (roughly in order of commonality):

1. Eliminate duplicates from an iterable input and loop over the result.
2. Store elements in a set just once but do many membership tests.
3. Perform data analysis on multiple sets using union, intersection, 
difference, and isdisjoint.
4. Remove elements from a set, one at a time, until it is empty.
5. Algorithms and that alternately add and remove different elements over time 
(toposort, NFA, DFA, etc).  

The first three are all penalized by an extra inner loop test and the loss of 
the register to track the freeslot.  Those use cases get no offsetting benefit.

Case 4 doesn't exercise the dummy reuse at all.  So, it is unaffected.

The last is approximately a net wash.  It pays the inner loop price, suffers 
the loss of a register, and incurs an extra test outside the loop, but it 
occasionally gets lucky and reuses a freeslot. The benefit for reuse is that it 
is delays the inevitable resize which would have reclaimed the dummy entries 
earlier. (The comparable use case for dicts is LRU/LFU caches where new entries 
are added at the same rate as old entries are removed.  That use case also 
showed a net wash when freeslot tracking was removed from dicts).

Not on the list at all is the case of a large set, where exactly the same 
element is added and removed in a tight loop.  The payoff for this case is that 
the O(n) resize step never occurs; however, this case is so rare that there is 
no reason to preference it over the common cases.

If the addition and removal of the same element happens only a few times, with 
no other set updates, the performance is about the same as case 5.

However, if there are many such updates and the set is large (as in the OP's 
code sample), the add() operation becomes quadratic because the chain of dummy 
or used entries grows larger and larger with each reinsertion.  (FWIW, dicts 
also face the same issue and some additional ones related to maintaining 
insertion order).  The question is whether this unhappy case is worth burdening 
all of the normal use cases.  If this turns out to be a real issue in practice, 
we can reopen this; otherwise, let's stick with what we have.

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue43198>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to