Mark Dickinson <dicki...@gmail.com> added the comment: It finally occurred to me that what might be killing 32-bit performance is the divisions, rather than the multiplications.
To test this, here's a version of 30bit_longdigit17.patch that replaces just *two* of the divisions in Objects/longsobject.c by the appropriate x86 divl assembler instruction. The result for pydigits is an astonishing 10% speedup! Results of running python pydigits_bestof.py 2000 on 32-bit OS X 10.5.6/Core 2 Duo: upatched py3k ------------- Best Time; 2212.6 ms 30bit_longdigit17.patch ----------------------- Best Time; 2283.9 ms (-3.1% relative to py3k) 30bit_longdigit17+asm.patch --------------------------- Best Time; 2085.7 ms (+6.1% relative to py3k) The problem is that (e.g., in the main loop of x_divrem) we're doing a 64-bit by 32-bit division, expecting a 32-bit quotient and a 32-bit remainder. From the analysis of the algorithm, *we* know that the quotient will always fit into 32 bits, so that e.g., on x86, a divl instruction is appropriate. But unless the compiler is extraordinarily clever it doesn't know this, so it produces an expensive library call to a function that probably involves multiple divisions and/or some branches, that produces the full 64-bit quotient. On 32-bit PowerPC things are even worse, since there there isn't even a 64-by-32 bit divide instruction; only a 32-bit by 32-bit division. So I could still be persuaded that 30-bit digits should only be enabled by default on 64-bit machines... Added file: http://bugs.python.org/file13149/30bit_longdigit17+asm.patch _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue4258> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com