Hi @ll, I don't use GCC, so I don't know whether there's a benchmark for __udivmodti4() and/or __udivmoddi4() for AMD64 and i386 processors.
If you have one: get my "slow" __udivmodti4() from <https://skanthak.homepage.t-online.de/integer.html#as-1> and run the benchmark, then my fast __udivmodti4() from <https://skanthak.homepage.t-online.de/integer.html#as-2> and repeat. The "slow" __udivmodti4() should be slightly faster than your current implementation for AMD64, while the fast one almost an order of magnitude... <https://skanthak.homepage.t-online.de/integer.html#summary> shows my numbers. And while you're there, also benchmark __udivmoddi4() from <https://skanthak.homepage.t-online.de/integer.html#as-3>, __umoddi3() from <https://skanthak.homepage.t-online.de/integer.html#as-4>, __moddi3() from <https://skanthak.homepage.t-online.de/integer.html#as-5>, as well as (after trivial editing) __udivdi3() from <https://skanthak.homepage.t-online.de/integer.html#ml-1> and __divdi3() from <https://skanthak.homepage.t-online.de/integer.html#ml-2> regards Stefan