STINNER Victor <vstin...@python.org> added the comment:
The performance issue was noticed by Raymond Hettinger who ran a microbenchmark on tuplegetter_descr_get(), comparison between Python 3.8 and Python 3.9: https://mail.python.org/archives/list/python-...@python.org/message/Q3YHYIKNUQH34FDEJRSLUP2MTYELFWY3/ INADA-san confirms that the performance regression was introduced by the commit 45ec5b99aefa54552947049086e87ec01bc2fc9a (bpo-40170) which changes PyType_HasFeature() implementation to always call PyType_GetFlags() as a function rather than reading directly the PyTypeObject.tp_flags member. https://mail.python.org/archives/list/python-...@python.org/message/FOKJXG2SYMXCHYPGUZWVYMHLDR42BYFB/ On Fedora 32, there is no performance difference because binaries are built with GCC using LTO and PGO: the PyType_GetFlags() function call is inlined by GCC 10. I built Python on macOS with clang 11.0.3 on macOS 10.15.4, and I confirm that LTO+PGO allows to inline the PyType_GetFlags() function call in tuplegetter_descr_get(). Using "./configure && make": --- $ lldb ./python.exe (lldb) disassemble --name tuplegetter_descr_get (...) python.exe[0x1001c46ad] <+29>: callq 0x10009c720 ; PyType_GetFlags at typeobject.c:2338 python.exe[0x1001c46b2] <+34>: testl $0x4000000, %eax ; imm = 0x4000000 (...) --- Using "./configure --with-lto --enable-optimizations && make": --- $ lldb ./python.exe (lldb) disassemble --name tuplegetter_descr_get (...) python.exe[0x1002a9542] <+18>: movq 0x10(%rbx), %rdx python.exe[0x1002a9546] <+22>: movq 0x8(%rsi), %rax python.exe[0x1002a954a] <+26>: testb $0x4, 0xab(%rax) python.exe[0x1002a9551] <+33>: je 0x1002a956f ; <+63> (...) --- ---------- _______________________________________ Python tracker <rep...@bugs.python.org> <https://bugs.python.org/issue41181> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com