Re: [PATCH] [tree-optimization/110279] swap operands in reassoc to reduce cross backedge FMA

Jeff Law via Gcc-patches Tue, 29 Aug 2023 05:34:20 -0700



On 8/29/23 01:41, Richard Biener wrote:


     _1 = a * b;
     _2 = .FMA (c, d, _1);
     acc_1 = acc_0 + _2;


How can we execute the multiply and the FMA in parallel?  They
depend on each other.  Or is it the uarch can handle dependence
on the add operand but only when it is with a multiplication and
not a FMA in some better ways?  (I'd doubt so much complexity)

I've worked on an architecture that could almost do that. The opsdidn't run in parallel, but instead serially as "chained" FP ops.

Essentially in cases where you could chain them they become a singleinstruction. These were fully piped, thus issuing every cycle. Latencywas 1c faster than if you'd issued the ops as distinct instructions.More importantly, by combining the two FP ops into a single instructionyou could issue more FP ops/cycle which significantly helps many FPcodes. It's safe to assume this required notable additional FPhardware, but it's something we already had in the design for otherpurposes.

I keep hoping that architecture becomes public. There were some otherreally interesting features in the design that could be incorporatedinto other designs with minimal hardware cost.


Can you explain in more detail how the uarch executes one vs. the
other case?

Probably can't say more than I already have.

Anyway, given the architecture in question is still private and nolonger something I have to champion, if you want to move forward withwith the patch, I won't object.


jeff

Re: [PATCH] [tree-optimization/110279] swap operands in reassoc to reduce cross backedge FMA

Reply via email to