In a post about CPython's GIL, Steven D'Aprano pointed to Armin Ronacher's criticism of the internal type slots used for dunder methods.

> http://lucumr.pocoo.org/2014/8/16/the-python-i-would-like-to-see/

I found the following interesting.
"
Since we have an __add__ method the interpreter will set this up in a slot. So how fast is it? When we do a + b we will use the slots, so here is what it times it as:

$ python3 -mtimeit -s 'from x import A; a = A(); b = A()' 'a + b'
1000000 loops, best of 3: 0.256 usec per loop

If we do however a.__add__(b) we bypass the slot system. Instead the interpreter is looking in the instance dictionary (where it will not find anything) and then looks in the type's dictionary where it will find the method. Here is where that clocks in at:

$ python3 -mtimeit -s 'from x import A; a = A(); b = A()' 'a.__add__(b)'
10000000 loops, best of 3: 0.158 usec per loop

Can you believe it: the version without slots is actually faster. What magic is that? I'm not entirely sure what the reason for this is,"

Curious myself, I repeated the result on my Win7 machine and got almost the same numbers.

>>> class A:
        def __add__(self, other): return 2
>>> timeit.repeat('a + b', 'from __main__ import A; a=A(); b=A()')
[0.26080520927348516, 0.24120280310165754, 0.2412111032140274]
>>> timeit.repeat('a.__add__(b)', 'from __main__ import A; a=A(); b=A()')
[0.17656398710346366, 0.15274235713354756, 0.1528444177747872]

First I looked at the byte code.

>>> dis('a+b')
  1           0 LOAD_NAME                0 (a)
              3 LOAD_NAME                1 (b)
              6 BINARY_ADD
              7 RETURN_VALUE
>>> dis('a.__add__(b)')
  1           0 LOAD_NAME                0 (a)
              3 LOAD_ATTR                1 (__add__)
              6 LOAD_NAME                2 (b)
              9 CALL_FUNCTION            1 (1 positional, 0 keyword pair)
             12 RETURN_VALUE

Next the core of BINARY_ADD add code in Python/ceval.c:
            if (PyUnicode_CheckExact(left) &&
                     PyUnicode_CheckExact(right)) {
                sum = unicode_concatenate(left, right, f, next_instr);
                /* unicode_concatenate consumed the ref to v */
            }
            else {
                sum = PyNumber_Add(left, right);

By the language definition, PyNumber_Add must whether issubclass(type(b), type(a)). If so, it tries b.__radd__(a). Otherwise it tries a.__add__(b). BINARY_ADD has extra overhead before it calls __add__. Enough to explain the differnce between .09 microsecond difference between .25 and .16 microseconds?

Lets try some builtins..

>>> timeit.repeat('1+1')
[0.04067762117549266, 0.019206152658126363, 0.018796680446902643]
>>> timeit.repeat('1.0+1.0')
[0.032686457413774406, 0.023207729064779414, 0.018793606331200863]
>>> timeit.repeat('(1.0+1j) + (1.0-1j)')
[0.037775348543391374, 0.01876409482042618, 0.018812358436889554]

>>> timeit.repeat("''+''")
[0.04073695160855095, 0.018977745861775475, 0.018800676797354754]
>>> timeit.repeat("'a'+'b'")
[0.04066932106320564, 0.01896145304840502, 0.01879268409652468]

>>> timeit.repeat('1 .__add__(1)')
[0.16622020259652004, 0.15244908649577837, 0.15047857833215517]
>>> timeit.repeat("''.__add__('')")
[0.17265801569533323, 0.1535966538865523, 0.15308880997304186]

For the common case of adding builtin numbers and empty strings, the binary operation is about 8 times as fast as the dict lookup and function call. For empty lists, the ratio is about 3

>>> timeit.repeat('[]+[]')
[0.09728684696551682, 0.08233527043626054, 0.08230698857164498]
>>> timeit.repeat('[].__add__([])')
[0.22780949582033827, 0.2060266193825555, 0.2060967092206738]

Conclusions:
1. Python-level function calls to C wrappers of C functions are as slow as calls to Pythons functions (which I already knew to be relatively slow).

2. Taking into account the interpreters internal binary operations on builtin Python objects, I suspect most everyone benefits from the slot optimization.

3. The total BINARY_ADD + function call time for strings and number, about .02 microseconds is much less that the .09 difference and cannot account for it.

4. There might be some avoidable overhead within PyNumber_ADD that only affects custom-class instances (but I am done for tonight;-).

--
Terry Jan Reedy

--
https://mail.python.org/mailman/listinfo/python-list

Reply via email to