User class binary ops seem too slow (was re: GIL detector)

Terry Reedy Sun, 17 Aug 2014 21:47:38 -0700

In a post about CPython's GIL, Steven D'Aprano pointed to ArminRonacher's criticism of the internal type slots used for dunder methods.


> http://lucumr.pocoo.org/2014/8/16/the-python-i-would-like-to-see/


I found the following interesting.
"

Since we have an __add__ method the interpreter will set this up in aslot. So how fast is it? When we do a + b we will use the slots, so hereis what it times it as:


$ python3 -mtimeit -s 'from x import A; a = A(); b = A()' 'a + b'
1000000 loops, best of 3: 0.256 usec per loop

If we do however a.__add__(b) we bypass the slot system. Instead theinterpreter is looking in the instance dictionary (where it will notfind anything) and then looks in the type's dictionary where it willfind the method. Here is where that clocks in at:


$ python3 -mtimeit -s 'from x import A; a = A(); b = A()' 'a.__add__(b)'
10000000 loops, best of 3: 0.158 usec per loop

Can you believe it: the version without slots is actually faster. Whatmagic is that? I'm not entirely sure what the reason for this is,"

Curious myself, I repeated the result on my Win7 machine and got almostthe same numbers.


>>> class A:
        def __add__(self, other): return 2
>>> timeit.repeat('a + b', 'from __main__ import A; a=A(); b=A()')
[0.26080520927348516, 0.24120280310165754, 0.2412111032140274]
>>> timeit.repeat('a.__add__(b)', 'from __main__ import A; a=A(); b=A()')
[0.17656398710346366, 0.15274235713354756, 0.1528444177747872]

First I looked at the byte code.

>>> dis('a+b')
  1           0 LOAD_NAME                0 (a)
              3 LOAD_NAME                1 (b)
              6 BINARY_ADD
              7 RETURN_VALUE
>>> dis('a.__add__(b)')
  1           0 LOAD_NAME                0 (a)
              3 LOAD_ATTR                1 (__add__)
              6 LOAD_NAME                2 (b)
              9 CALL_FUNCTION            1 (1 positional, 0 keyword pair)
             12 RETURN_VALUE

Next the core of BINARY_ADD add code in Python/ceval.c:
            if (PyUnicode_CheckExact(left) &&
                     PyUnicode_CheckExact(right)) {
                sum = unicode_concatenate(left, right, f, next_instr);
                /* unicode_concatenate consumed the ref to v */
            }
            else {
                sum = PyNumber_Add(left, right);

By the language definition, PyNumber_Add must whetherissubclass(type(b), type(a)). If so, it tries b.__radd__(a). Otherwiseit tries a.__add__(b). BINARY_ADD has extra overhead before it calls__add__. Enough to explain the differnce between .09 microseconddifference between .25 and .16 microseconds?


Lets try some builtins..

>>> timeit.repeat('1+1')
[0.04067762117549266, 0.019206152658126363, 0.018796680446902643]
>>> timeit.repeat('1.0+1.0')
[0.032686457413774406, 0.023207729064779414, 0.018793606331200863]
>>> timeit.repeat('(1.0+1j) + (1.0-1j)')
[0.037775348543391374, 0.01876409482042618, 0.018812358436889554]

>>> timeit.repeat("''+''")
[0.04073695160855095, 0.018977745861775475, 0.018800676797354754]
>>> timeit.repeat("'a'+'b'")
[0.04066932106320564, 0.01896145304840502, 0.01879268409652468]

>>> timeit.repeat('1 .__add__(1)')
[0.16622020259652004, 0.15244908649577837, 0.15047857833215517]
>>> timeit.repeat("''.__add__('')")
[0.17265801569533323, 0.1535966538865523, 0.15308880997304186]

For the common case of adding builtin numbers and empty strings, thebinary operation is about 8 times as fast as the dict lookup andfunction call. For empty lists, the ratio is about 3


>>> timeit.repeat('[]+[]')
[0.09728684696551682, 0.08233527043626054, 0.08230698857164498]
>>> timeit.repeat('[].__add__([])')
[0.22780949582033827, 0.2060266193825555, 0.2060967092206738]

Conclusions:

1. Python-level function calls to C wrappers of C functions are as slowas calls to Pythons functions (which I already knew to be relatively slow).

2. Taking into account the interpreters internal binary operations onbuiltin Python objects, I suspect most everyone benefits from the slotoptimization.

3. The total BINARY_ADD + function call time for strings and number,about .02 microseconds is much less that the .09 difference and cannotaccount for it.

4. There might be some avoidable overhead within PyNumber_ADD that onlyaffects custom-class instances (but I am done for tonight;-).


--
Terry Jan Reedy

--
https://mail.python.org/mailman/listinfo/python-list

User class binary ops seem too slow (was re: GIL detector)

Reply via email to