On Wed, 18 Nov 2015 01:27 am, Nicolas Évrard wrote: > Hello, > > I saw the following retweet by Raymond Hettinger in this morning: > > https://twitter.com/sanityinc/status/666485814214287360 > > Programming tip: many of those arrays and hashes in your code > should actually be sets. Match data structures to data > constraints!
This is why I hold Twitter in contempt. What Raymond says could be a good tip, or a load of old hogwash, and there is no way of knowing because you can't explain much in 130 characters. If it were a blog post, he could explain what he means. But its a tweet, so he can't. Of course you should match the data structure to your data constraints, but what does that mean in practice? *Which* arrays and hashes should be sets? How do you know which should be changed to sets? > I saw just in time because in a review I wrote something like this: > > if operator not in ('where', 'not where') > > and my colleague proposed that I should use a list instead of a tuple. > But reading the mentioned tweet I tend to think that a set would be a > better choice. Exactly. Raymond's tweet only serves to muddy the water and add more confusion. Before, you thought you knew the right answer: use a tuple. Then, your colleague proposed a list, adding some uncertainty and confusion: which is better, a list or a tuple? And now Raymond's tweet has *increased* the uncertainty: should you use a set? > What are your opinion on this issue (I realize it's not something of > the utmost importance but rather a "philosophical" question). I would say that there is no doubt that in Python 2, using a tuple is far and away the best solution for the situation you show above. There are three factors to weigh up: (1) how easy is it to create the data structure? (2) how much space does the data structure use? (3) how fast is searching the data structure? How easy is it to create the data structure? In Python 2, lists and tuples are effectively just as easy to type, but sets not so much: ('where', 'not where') ['where', 'not where'] set(['where', 'not where']) Not only are sets longer to type, but they are created at runtime. Compare the byte-code compiled for each expression: py> from dis import dis py> a = compile("('where', 'not where')", "", "single") py> b = compile("['where', 'not where']", "", "single") py> c = compile("set(['where', 'not where'])", "", "single") py> dis(a) 1 0 LOAD_CONST 3 (('where', 'not where')) 3 PRINT_EXPR 4 LOAD_CONST 2 (None) 7 RETURN_VALUE py> dis(b) 1 0 LOAD_CONST 0 ('where') 3 LOAD_CONST 1 ('not where') 6 BUILD_LIST 2 9 PRINT_EXPR 10 LOAD_CONST 2 (None) 13 RETURN_VALUE py> dis(c) 1 0 LOAD_NAME 0 (set) 3 LOAD_CONST 0 ('where') 6 LOAD_CONST 1 ('not where') 9 BUILD_LIST 2 12 CALL_FUNCTION 1 15 PRINT_EXPR 16 LOAD_CONST 2 (None) 19 RETURN_VALUE Python can optimise the creation of the tuple to compile-time, but it has to create the list at run time. It also creates the set at run time, but it also needs to do a name lookup. All this takes time. How much space does each data structure take? Again, a tuple easily wins over the others: py> import sys py> sys.getsizeof(('where', 'not where')) 32 py> sys.getsizeof(['where', 'not where']) 40 py> sys.getsizeof(set(['where', 'not where'])) 112 The tuple is smaller than the list, and MUCH smaller than the set. (The exact sizes you see will depend on the precise version and platform you are using, but I am confident that the tuple will win.) Last but not least, which is fastest? For that, we can use the timeit module, running it as a script from the shell: [steve@ando ~]$ python -m timeit -s "x = 'not where'" "x in ('where', 'not where')" 10000000 loops, best of 3: 0.122 usec per loop [steve@ando ~]$ python -m timeit -s "x = 'not where'" "x in ['where', 'not where']" 10000000 loops, best of 3: 0.119 usec per loop [steve@ando ~]$ python -m timeit -s "x = 'not where'" "x in set(['where', 'not where'])" 1000000 loops, best of 3: 0.728 usec per loop The difference between 0.122 microseconds and 0.119 microseconds is too close to call. We can say that the tuple and the list are equally fast, and the set much, much slower. But... we can speed the set up, by only creating it once: [steve@ando ~]$ python -m timeit -s "x = 'not where'" -s "S = set(['where', 'not where'])" "x in S" 10000000 loops, best of 3: 0.101 usec per loop That's better! But only *marginally* faster than the tuple, and to get the speed increase you have to factor out the set into a constant. So for Python 2, I would say that tuples are the clean winner, at least for the case where you only have two items to check. If you have more than two, sets start becoming more attractive. How about Python 3? Python 3 has the advantage of using set literals. Let's test the speed: [steve@ando ~]$ python3.3 -m timeit -s "x = 'not where'" "x in ('where', 'not where')" 10000000 loops, best of 3: 0.11 usec per loop [steve@ando ~]$ python3.3 -m timeit -s "x = 'not where'" "x in ['where', 'not where']" 10000000 loops, best of 3: 0.113 usec per loop [steve@ando ~]$ python3.3 -m timeit -s "x = 'not where'" "x in {'where', 'not where'}" 10000000 loops, best of 3: 0.0907 usec per loop This makes sets much more attractive in Python 3. -- Steven -- https://mail.python.org/mailman/listinfo/python-list