In a post about CPython's GIL, Steven D'Aprano pointed to Armin
Ronacher's criticism of the internal type slots used for dunder methods.
> http://lucumr.pocoo.org/2014/8/16/the-python-i-would-like-to-see/
I found the following interesting.
"
Since we have an __add__ method the interpreter will set this up in a
slot. So how fast is it? When we do a + b we will use the slots, so here
is what it times it as:
$ python3 -mtimeit -s 'from x import A; a = A(); b = A()' 'a + b'
1000000 loops, best of 3: 0.256 usec per loop
If we do however a.__add__(b) we bypass the slot system. Instead the
interpreter is looking in the instance dictionary (where it will not
find anything) and then looks in the type's dictionary where it will
find the method. Here is where that clocks in at:
$ python3 -mtimeit -s 'from x import A; a = A(); b = A()' 'a.__add__(b)'
10000000 loops, best of 3: 0.158 usec per loop
Can you believe it: the version without slots is actually faster. What
magic is that? I'm not entirely sure what the reason for this is,"
Curious myself, I repeated the result on my Win7 machine and got almost
the same numbers.
>>> class A:
def __add__(self, other): return 2
>>> timeit.repeat('a + b', 'from __main__ import A; a=A(); b=A()')
[0.26080520927348516, 0.24120280310165754, 0.2412111032140274]
>>> timeit.repeat('a.__add__(b)', 'from __main__ import A; a=A(); b=A()')
[0.17656398710346366, 0.15274235713354756, 0.1528444177747872]
First I looked at the byte code.
>>> dis('a+b')
1 0 LOAD_NAME 0 (a)
3 LOAD_NAME 1 (b)
6 BINARY_ADD
7 RETURN_VALUE
>>> dis('a.__add__(b)')
1 0 LOAD_NAME 0 (a)
3 LOAD_ATTR 1 (__add__)
6 LOAD_NAME 2 (b)
9 CALL_FUNCTION 1 (1 positional, 0 keyword pair)
12 RETURN_VALUE
Next the core of BINARY_ADD add code in Python/ceval.c:
if (PyUnicode_CheckExact(left) &&
PyUnicode_CheckExact(right)) {
sum = unicode_concatenate(left, right, f, next_instr);
/* unicode_concatenate consumed the ref to v */
}
else {
sum = PyNumber_Add(left, right);
By the language definition, PyNumber_Add must whether
issubclass(type(b), type(a)). If so, it tries b.__radd__(a). Otherwise
it tries a.__add__(b). BINARY_ADD has extra overhead before it calls
__add__. Enough to explain the differnce between .09 microsecond
difference between .25 and .16 microseconds?
Lets try some builtins..
>>> timeit.repeat('1+1')
[0.04067762117549266, 0.019206152658126363, 0.018796680446902643]
>>> timeit.repeat('1.0+1.0')
[0.032686457413774406, 0.023207729064779414, 0.018793606331200863]
>>> timeit.repeat('(1.0+1j) + (1.0-1j)')
[0.037775348543391374, 0.01876409482042618, 0.018812358436889554]
>>> timeit.repeat("''+''")
[0.04073695160855095, 0.018977745861775475, 0.018800676797354754]
>>> timeit.repeat("'a'+'b'")
[0.04066932106320564, 0.01896145304840502, 0.01879268409652468]
>>> timeit.repeat('1 .__add__(1)')
[0.16622020259652004, 0.15244908649577837, 0.15047857833215517]
>>> timeit.repeat("''.__add__('')")
[0.17265801569533323, 0.1535966538865523, 0.15308880997304186]
For the common case of adding builtin numbers and empty strings, the
binary operation is about 8 times as fast as the dict lookup and
function call. For empty lists, the ratio is about 3
>>> timeit.repeat('[]+[]')
[0.09728684696551682, 0.08233527043626054, 0.08230698857164498]
>>> timeit.repeat('[].__add__([])')
[0.22780949582033827, 0.2060266193825555, 0.2060967092206738]
Conclusions:
1. Python-level function calls to C wrappers of C functions are as slow
as calls to Pythons functions (which I already knew to be relatively slow).
2. Taking into account the interpreters internal binary operations on
builtin Python objects, I suspect most everyone benefits from the slot
optimization.
3. The total BINARY_ADD + function call time for strings and number,
about .02 microseconds is much less that the .09 difference and cannot
account for it.
4. There might be some avoidable overhead within PyNumber_ADD that only
affects custom-class instances (but I am done for tonight;-).
--
Terry Jan Reedy
--
https://mail.python.org/mailman/listinfo/python-list