Jeff Law <l...@redhat.com> wrote: > On 11/10/20 10:21 AM, Stefan Kanthak wrote: > >>> So with all that in mind, I installed everything except the bits which >>> have the LIBGCC2_BAD_CODE ifdefs after testing on the various crosses. >>> If you could remove the ifdefs on the abs/mult changes and resubmit them >>> it'd be appreciated. >> Done. > THanks. I'm doing some testing on the abs changes right now. They look > pretty reasonable, though they will tend to generate worse code on > targets that don't handle overflow arithmetic and testing all that well.
OTOH the changes yield better code on targets which have a proper overflow handling, and may benefit from eventual improvements in the compiler/code generator itself on all targets. > H8 would be an example that generates more efficient code with the > original implementation. But I don't think the H8 is terribly > representative of embedded ports these days, though it's likely to get > better at precisely these kinds of scenarios in the near future as > certain modernization patches continue landing. > > Something like visium is much better for evaluation as it's got a modern > structure which includes reasonable exposure of carry/overflow. Not > surprisingly the visium port generates better code with your improved > abs routines. > > > > > >> libgcc2.patch >> >> --- -/libgcc/libgcc2.c >> +++ +/libgcc/libgcc2.c >> >> #ifdef L_mulvdi3 >> DWtype >> -__mulvDI3 (DWtype u, DWtype v) >> +__mulvDI3 (DWtype a, DWtype b) > [ ... ] > You've essentially collapsed the function into: > > DWtype > __mulvDI3 (DWtype a, DWtype b) > { > DWtype t; > const DWunion u = {.ll = a}; > const DWunion v = {.ll = b}; > DWunion w = {.ll = __umulsidi3 (u.s.low, v.s.low)}; > > if (__builtin_mul_overflow (u.s.low, v.s.high, &t) > || __builtin_add_overflow (t, w.s.high, &w.s.high) > || __builtin_mul_overflow (u.s.high, v.s.low, &t) > || __builtin_add_overflow (t, w.s.high, &w.s.high)) > abort (); > > return w.ll > } > > The first thing to note is that I believe that using "DWtype" for "t" is > going to inhibit the overflow check for the __builtin_mul_overflow calls > that store their result into "t". Your inputs are both 32bits and the > output is 64 bits. Multiplying 2 32bit numbers will not overflow a > 64bit result. As a result those overflow checks are actually removed in > the code we generate for mulvDI3, defeating the entire purpose of using > mulvdi3 instead of muldi3. ARGH, my fault(s). I wanted to show an alternative, but less efficient implementation which I guarded with the #ifdef, but garbled the main implementation. > Second while the result of multiplying u.s.high with v.s.high can never > change the result in w.ll, it does affect the overall overflow status > that we need to compute. > > I think (but have not yet verified) that to fix these problems we need > to change the type of "t" to SItype and add a check like: > > if (u.s.high && v.s.high) > abort (); Correct in both cases! See <https://skanthak.homepage.t-online.de/gcc.html#case17> for my original code. I rewrote it to minimise the changes to the current implementation and introduced 2 silly errors on the way. > Also note that your approach always does 3 multiplies, which can be very > expensive on some architectures. The existing version in libgcc2.c will > often just do one or two multiplies. So while your implementation looks > a lot simpler, I suspect its often much slower. And on targets without > 32bit multiplication support, it's probably horribly bad. All (current) processors I know have super-scalar architecture and a hardware multiplier, they'll execute the 3 multiplies in parallel. In my tests on i386 and AMD64 (Core2, Skylake, Ryzen/EPYC), the code generated for <https://skanthak.homepage.t-online.de/gcc.html#case17> as well as the (almost) branch-free code shown below and in <https://skanthak.homepage.t-online.de/gcc.html#case13> runs 10% to 25% faster than the __mulvDI3 routine from libgcc: the many conditional branches of your current implementation impair performance more than 3 multiplies! > My inclination is to leave the overflow checking double-word multiplier > as-is. See but <https://gcc.gnu.org/pipermail/gcc/2020-October/234048.html> ff. > Though I guess you could keep the same structure as the existing > implementation which tries to avoid unnecessary multiplies and still use > the __builtin_{add,mul}_overflow to simplify the code a bit less > aggressively. Tertium datur: take a look at the __udivmodDI4 routine. It has separate code paths for targets without hardware divider, and also for targets where the hardware divider needs a normalized dividend. I therefore propose to add separate code paths for targets with and without hardware multiplier for the __mulvDI3 routine too, guarded by a preprocessor macro which tells whether a target has a hardware multiplier. regards Stefan --- umulvdi3.s --- # Copyright © 2004-2020, Stefan Kanthak <stefan.kant...@nexgo.de> .arch generic32 .code32 .intel_syntax noprefix .text # [esp+16] = high dword of multiplier # [esp+12] = low dword of multiplier # [esp+8] = high dword of multiplicand # [esp+4] = low dword of multiplicand __umulvdi3: push ebx xor ebx, ebx # ebx = 0 mov ecx, [esp+12] # ecx = high dword of multiplicand cmp ebx, ecx sbb edx, edx # edx = (high dword of multiplicand == 0) ? 0 : -1 # = (multiplicand < 2**32) ? 0 : -1 mov eax, [esp+20] # eax = high dword of multiplier cmp ebx, eax sbb ebx, ebx # ebx = (high dword of multiplier == 0) ? 0 : -1 # = (multiplier < 2**32) ? 0 : -1 and ebx, edx # ebx = (multiplicand < 2**32) # & (multiplier < 2**32) ? 0 : -1 # = (product < 2**64) ? 0 : -1 mov edx, [esp+8] # edx = low dword of multiplicand mul edx # edx:eax = high dword of multiplier # * low dword of multiplicand adc ebx, ebx # ebx = (product < 2**64) ? 0 : * xchg ecx, eax # ecx = high dword of multiplier # * low dword of multiplicand, # eax = high dword of multiplicand mov edx, [esp+16] # edx = low dword of multiplier mul edx # edx:eax = high dword of multiplicand # * low dword of multiplier adc ebx, ebx # ebx = (product < 2**64) ? 0 : * add ecx, eax # ecx = high dword of multiplier # * low dword of multiplicand # + high dword of multiplicand # * low dword of multiplier # adc ebx, ebx # ebx = (product < 2**64) ? 0 : * mov eax, [esp+8] # eax = low dword of multiplicand mov edx, [esp+16] # edx = low dword of multiplier mul edx # edx:eax = low dword of multiplicand # * low dword of multiplier add edx, ecx # edx:eax = product % 2**64 adc ebx, ebx # ebx = (product < 2**64) ? 0 : * jnz .Loverflow # product >= 2**64? pop ebx ret .Loverflow: ud2 .size __umulvdi3, .-__umulvdi3 .type __umulvdi3, @function .global __umulvdi3 .end --- EOF --- --- mulvdi3.s --- # Copyright © 2004-2020, Stefan Kanthak <stefan.kant...@nexgo.de> .arch generic32 .code32 .intel_syntax noprefix .text # [esp+16] = high dword of multiplier # [esp+12] = low dword of multiplier # [esp+8] = high dword of multiplicand # [esp+4] = low dword of multiplicand __mulvdi3: push ebx mov eax, [esp+20] # eax = high dword of multiplier mov ecx, [esp+16] # ecx = low dword of multiplier cdq # edx = (multiplier < 0) ? -1 : 0 mov ebx, edx # ebx = (multiplier < 0) ? -1 : 0 xor ecx, edx xor eax, edx # eax:ecx = (multiplier < 0) ? ~multiplier : multiplier sub ecx, edx sbb eax, edx # eax:ecx = (multiplier < 0) ? -multiplier : multiplier # = |multiplier| mov [esp+16], ecx mov [esp+20], eax # multiplier = |multiplier| mov eax, [esp+12] # eax = high dword of multiplicand mov ecx, [esp+8] # ecx = low dword of multiplicand cdq # edx = (multiplicand < 0) ? -1 : 0 xor ebx, edx # ebx = (multiplier < 0) # ^ (multiplicand < 0) ? -1 : 0 # = (product < 0) ? -1 : 0 xor ecx, edx xor eax, edx # eax:ecx = (multiplicand < 0) ? ~multiplicand : multiplicand sub ecx, edx sbb eax, edx # eax:ecx = (multiplicand < 0) ? -multiplicand : multiplicand # = |multiplicand| mov [esp+8], ecx mov [esp+12], eax # multiplicand = |multiplicand| push ebx # [esp] = (product < 0) ? -1 : 0 xor edx, edx # edx = 0 # mov eax, [esp+16] # eax = high dword of |multiplicand| cmp edx, eax sbb ebx, ebx # ebx = (high dword of |multiplicand| == 0) ? 0 : -1 # = (|multiplicand| < 2**32) ? 0 : -1 mov ecx, [esp+24] # ecx = high dword of |multiplier| cmp edx, ecx sbb edx, edx # edx = (high dword of |multiplier| == 0) ? 0 : -1 # = (|multiplier| < 2**32) ? 0 : -1 and ebx, edx # ebx = (|multiplicand| < 2**32) # & (|multiplier| < 2**32) ? 0 : -1 # = (|product| < 2**64) ? 0 : -1 .ifdef INTO shr ebx, 1 into mov ebx, [esp+20] # ebx = low dword of |multiplier| mul ebx # edx:eax = high dword of |multiplicand| # * low dword of |multiplier| into xchg ecx, eax # ecx = high dword of |multiplicand| # * low dword of |multiplier|, # eax = high dword of |multiplier| mov edx, [esp+12] # edx = low dword of |multiplicand| mul edx # edx:eax = high dword of |multiplier| # * low dword of |multiplicand| into add ecx, eax # ecx = high dword of |multiplicand| # * low dword of |multiplier| # + high dword of |multiplier| # * low dword of |multiplicand| # sbb eax, eax # eax = (|product| < 2**64) ? 0 : -1 # shr eax, 1 # into mov eax, [esp+12] # eax = low dword of |multiplicand| mul ebx # edx:eax = low dword of |multiplicand| # * low dword of |multiplier| add edx, ecx # edx:eax = |product % 2**64| # = |product| % 2**64 sbb ecx, ecx # ecx = (|product| < 2**64) ? 0 : -1 shr ecx, 1 into pop ebx # ebx = (product < 0) ? -1 : 0 add eax, ebx adc edx, ebx # edx:eax = (product < 0) ? ~product % 2**64 : product % 2**64 mov ecx, edx shr ecx, 1 into xor eax, ebx xor edx, ebx # edx:eax = product % 2**64 pop ebx ret .else mov edx, [esp+20] # edx = low dword of |multiplier| mul edx # edx:eax = high dword of |multiplicand| # * low dword of |multiplier| adc ebx, ebx # ebx = (|product| < 2**64) ? 0 : * xchg ecx, eax # ecx = high dword of |multiplicand| # * low dword of |multiplier|, # eax = high dword of |multiplier| mov edx, [esp+12] # edx = low dword of |multiplicand| mul edx # edx:eax = high dword of |multiplier| # * low dword of |multiplicand| adc ebx, ebx # ebx = (|product| < 2**64) ? 0 : * add ecx, eax # ecx = high dword of |multiplicand| # * low dword of |multiplier| # + high dword of |multiplier| # * low dword of |multiplicand| # adc ebx, ebx # ebx = (|product| < 2**64) ? 0 : * mov eax, [esp+12] # eax = low dword of |multiplicand| mov edx, [esp+20] # edx = low dword of |multiplier| mul edx # edx:eax = low dword of |multiplicand| # * low dword of |multiplier| add edx, ecx # edx:eax = |product % 2**64| # = |product| % 2**64 adc ebx, ebx # ebx = (|product| < 2**64) ? 0 : * pop ecx # ecx = (product < 0) ? -1 : 0 xor eax, ecx xor edx, ecx # edx:eax = (product < 0) ? product % 2**64 - 1 : product % 2**64 sub eax, ecx sbb edx, ecx # edx:eax = product % 2**64 xor ecx, edx # ecx = (multiplicand < 0) # ^ (multiplier < 0) # ^ (product < 0) ? negative : positive add ecx, ecx adc ebx, ebx # ebx = (-2**63 <= product < 2**63) ? 0 : * jnz .Loverflow # product < -2**63? # product >= 2**63? pop ebx ret .Loverflow: ud2 .endif # INTO .size __mulvdi3, .-__mulvdi3 .type __mulvdi3, @function .global __mulvdi3 .end --- EOF ---