On 8/29/23 01:41, Richard Biener wrote:
_1 = a * b;
_2 = .FMA (c, d, _1);
acc_1 = acc_0 + _2;
How can we execute the multiply and the FMA in parallel? They
depend on each other. Or is it the uarch can handle dependence
on the add operand but only when it is with a multiplication and
not a FMA in some better ways? (I'd doubt so much complexity)
I've worked on an architecture that could almost do that. The ops
didn't run in parallel, but instead serially as "chained" FP ops.
Essentially in cases where you could chain them they become a single
instruction. These were fully piped, thus issuing every cycle. Latency
was 1c faster than if you'd issued the ops as distinct instructions.
More importantly, by combining the two FP ops into a single instruction
you could issue more FP ops/cycle which significantly helps many FP
codes. It's safe to assume this required notable additional FP
hardware, but it's something we already had in the design for other
purposes.
I keep hoping that architecture becomes public. There were some other
really interesting features in the design that could be incorporated
into other designs with minimal hardware cost.
Can you explain in more detail how the uarch executes one vs. the
other case?
Probably can't say more than I already have.
Anyway, given the architecture in question is still private and no
longer something I have to champion, if you want to move forward with
with the patch, I won't object.
jeff