> Note that since this isn't a regression it'll need to wait for gcc-16 
> development to open before the patch can go forward.

> Pan Li sent a similar patch for vadd.vv/vadd.vx I think in November and I 
> believe he intended to continue when stage 1 opens.

Yes, thanks Robin and Jeff, I will re-send the patch of vadd.vv/vx after stage 
1 open, and then all other
similar cases.

Pan

-----Original Message-----
From: Jeff Law <jeffreya...@gmail.com> 
Sent: Sunday, March 30, 2025 8:31 AM
To: Robin Dapp <rdapp....@gmail.com>; Paul-Antoine Arras <par...@baylibre.com>; 
gcc-patches@gcc.gnu.org; Li, Pan2 <pan2...@intel.com>
Subject: Re: [PATCH] RISC-V: Add pattern for vector-scalar multiply-add/sub 
[PR119100]



On 3/27/25 1:39 PM, Robin Dapp wrote:
> Hi Paul-Antoine,
> 
>> This pattern enables the combine pass to merge a vec_duplicate into a 
>> plus-mult
>> or minus-mult RTL instruction.
>>
>> Before this patch, we have two instructions, e.g.:
>>   vfmv.v.f        v6,fa0
>>   vfmadd.vv       v9,v6,v7
>>
>> After, we get only one:
>>   vfmadd.vf       v9,fa0,v7
>>
>> On SPEC2017's 503.bwaves_r, depending on the workload, the reduction 
>> in dynamic
>> instruction count varies from -4.66% to -4.75%.
> 
> The general issue with this kind of optimization (we have discussed it a 
> few times already) is that, depending on the uarch, we want the local 
> combine optimization that you show but not the fwprop/late-combine one 
> where we propagate a vector broadcast into a loop.
> 
> So IMHO in order to continue with this and similar patterns we need at 
> least accompanying rtx_cost handling that would allow us to tune per uarch.
> 
> Pan Li sent a similar patch for vadd.vv/vadd.vx I think in November and 
> I believe he intended to continue when stage 1 opens.
> 
> An outstanding question is how to distinguish the combine case from the 
> late-combine case.  I haven't yet thought about that in detail.
The other thing we should consider is that we can certainly theorize 
that this kind of register file crossing case can have an extra penalty 
(it traditionally does), we don't have actual evidence that it's causing 
a problem on any RISC-V designs.

So may be the way to go is add a field to the uarch tuning structure 
indicating the additional cost (if any) of a register file crossing 
vector op of this nature.  Then query that in riscv_rtx_costs or 
whatever our rtx_cost function is named.

Default that additional cost to zero initially.  Then uarch experts can 
fill in the appropriate value.  Yea, such a simplistic approach wouldn't 
handle cases like ours where you really need nearby context to be sure, 
but I don't think we want to over-engineer this solution too badly right 
now.

Note that since this isn't a regression it'll need to wait for gcc-16 
development to open before the patch can go forward.

Thanks!
JEff


ps.  I know Baylibre's remit was to test dynamic icounts and there were 
good reasons for that.  So don't worry about not having run it on 
design.  If you happen to still have executables, pass them along 
privately, I can run them on a BPI.  ROMS is a few hours of runtime, but 
that's not a big deal.


Reply via email to