On 8/29/23 01:41, Richard Biener wrote:

     _1 = a * b;
     _2 = .FMA (c, d, _1);
     acc_1 = acc_0 + _2;

How can we execute the multiply and the FMA in parallel?  They
depend on each other.  Or is it the uarch can handle dependence
on the add operand but only when it is with a multiplication and
not a FMA in some better ways?  (I'd doubt so much complexity)
I've worked on an architecture that could almost do that. The ops didn't run in parallel, but instead serially as "chained" FP ops.

Essentially in cases where you could chain them they become a single instruction. These were fully piped, thus issuing every cycle. Latency was 1c faster than if you'd issued the ops as distinct instructions. More importantly, by combining the two FP ops into a single instruction you could issue more FP ops/cycle which significantly helps many FP codes. It's safe to assume this required notable additional FP hardware, but it's something we already had in the design for other purposes.

I keep hoping that architecture becomes public. There were some other really interesting features in the design that could be incorporated into other designs with minimal hardware cost.




Can you explain in more detail how the uarch executes one vs. the
other case?
Probably can't say more than I already have.

Anyway, given the architecture in question is still private and no longer something I have to champion, if you want to move forward with with the patch, I won't object.

jeff

Reply via email to