On 11/25/20 6:18 AM, Stefan Kanthak wrote:
> Jeff Law <l...@redhat.com> wrote:
>
>> On 11/10/20 10:21 AM, Stefan Kanthak wrote:
>>
>>>> So with all that in mind, I installed everything except the bits which
>>>> have the LIBGCC2_BAD_CODE ifdefs after testing on the various crosses.
>>>> If you could remove the ifdefs on the abs/mult changes and resubmit them
>>>> it'd be appreciated.
>>> Done.
>> THanks. I'm doing some testing on the abs changes right now. They look
>> pretty reasonable, though they will tend to generate worse code on
>> targets that don't handle overflow arithmetic and testing all that well.
> OTOH the changes yield better code on targets which have a proper
> overflow handling, and may benefit from eventual improvements in the
> compiler/code generator itself on all targets.
I mentioned it mostly because I wanted others to be aware that there are
targets where the abs changes may generate slightly worse code and that
resolution is (IMHO) mostly a matter of improving overflow handling in
the target.  These issues are small enough that I don't think they
should hinder the abs changes moving forward.


>
>> Also note that your approach always does 3 multiplies, which can be very
>> expensive on some architectures. The existing version in libgcc2.c will
>> often just do one or two multiplies. So while your implementation looks
>> a lot simpler, I suspect its often much slower. And on targets without
>> 32bit multiplication support, it's probably horribly bad.
> All (current) processors I know have super-scalar architecture and a
> hardware multiplier, they'll execute the 3 multiplies in parallel.
> In my tests on i386 and AMD64 (Core2, Skylake, Ryzen/EPYC), the code
> generated for <https://skanthak.homepage.t-online.de/gcc.html#case17>
> as well as the (almost) branch-free code shown below and in
> <https://skanthak.homepage.t-online.de/gcc.html#case13> runs 10% to 25%
> faster than the __mulvDI3 routine from libgcc: the many conditional
> branches of your current implementation impair performance more than 3
> multiplies!
All the world is not an x86.  GCC supports over 30 distinct processor
types, many of which target the embedded world.   Those chips often have
limited multiply capabilities and they're often quite slow with
no/minimal pipelining and no superscalar or out of order capabilities.

THe fact that it runs faster on x86 is good, but we have to think in a
more broad fashion.  As it stands right now I'm not going to put the
multiply changes in.  If you wanted to rework them so they're less
costly on the embedded targets, then that would be helpful.


>
>> My inclination is to leave the overflow checking double-word multiplier
>> as-is.
> See but <https://gcc.gnu.org/pipermail/gcc/2020-October/234048.html> ff.
Already read and considered it. 
>
>> Though I guess you could keep the same structure as the existing
>> implementation which tries to avoid unnecessary multiplies and still use
>> the __builtin_{add,mul}_overflow to simplify the code a bit less
>> aggressively.
> Tertium datur: take a look at the __udivmodDI4 routine.
> It has separate code paths for targets without hardware divider, and
> also for targets where the hardware divider needs a normalized dividend.
> I therefore propose to add separate code paths for targets with and
> without hardware multiplier for the __mulvDI3 routine too, guarded by a
> preprocessor macro which tells whether a target has a hardware multiplier.
I don't think there is a way to indicate that there's a hardware
multipler available (or what capabilities it might have -- some might
just have a 16x16 multiplier with or without widening variants, it can
depend on precisely what revision of the chip you're targeting -- which
can change based on compielr flags) and I would oppose a change that
adds something like TARGET_HAS_NO_HW_DIVIDE.  That's a wart and one I
would oppose spreading further.

Instead keep the tests that detect the special cases that don't need as
many multiplies and use the overflow builtins within that implementation
framework.  In cases where we can use the operands directly, that's
helpful as going through the struct/union likely leads to unnecessary
register shuffling.

jeff

Reply via email to