On Fri, Jul 19, 2024 at 11:08 AM Jeff Law <jeffreya...@gmail.com> wrote: > > > > On 7/19/24 2:55 AM, demin.han wrote: > > Currently, some binops of vector vs double scalar under RV32 can't > > translated to vf but vfmv+vxx.vv. > > > > The cause is that vec_duplicate is also expanded to broadcast for double > > mode > > under RV32. last-combine can't process expanded broadcast. > > > > gcc/ChangeLog: > > > > * config/riscv/vector.md: Add !FLOAT_MODE_P constrain > > > > gcc/testsuite/ChangeLog: > > > > * gcc.target/riscv/rvv/autovec/binop/vadd-rv32gcv-nofm.c: Fix test > > * gcc.target/riscv/rvv/autovec/binop/vdiv-rv32gcv-nofm.c: Ditto > > * gcc.target/riscv/rvv/autovec/binop/vmul-rv32gcv-nofm.c: Ditto > > * gcc.target/riscv/rvv/autovec/binop/vsub-rv32gcv-nofm.c: Ditto > > * gcc.target/riscv/rvv/autovec/cond/cond_copysign-rv32gcv.c: Ditto > > * gcc.target/riscv/rvv/autovec/cond/cond_fadd-1.c: Ditto > > * gcc.target/riscv/rvv/autovec/cond/cond_fadd-2.c: Ditto > > * gcc.target/riscv/rvv/autovec/cond/cond_fadd-3.c: Ditto > > * gcc.target/riscv/rvv/autovec/cond/cond_fadd-4.c: Ditto > > * gcc.target/riscv/rvv/autovec/cond/cond_fma_fnma-1.c: Ditto > > * gcc.target/riscv/rvv/autovec/cond/cond_fma_fnma-3.c: Ditto > > * gcc.target/riscv/rvv/autovec/cond/cond_fma_fnma-4.c: Ditto > > * gcc.target/riscv/rvv/autovec/cond/cond_fma_fnma-5.c: Ditto > > * gcc.target/riscv/rvv/autovec/cond/cond_fma_fnma-6.c: Ditto > > * gcc.target/riscv/rvv/autovec/cond/cond_fmax-1.c: Ditto > > * gcc.target/riscv/rvv/autovec/cond/cond_fmax-2.c: Ditto > > * gcc.target/riscv/rvv/autovec/cond/cond_fmax-3.c: Ditto > > * gcc.target/riscv/rvv/autovec/cond/cond_fmax-4.c: Ditto > > * gcc.target/riscv/rvv/autovec/cond/cond_fmin-1.c: Ditto > > * gcc.target/riscv/rvv/autovec/cond/cond_fmin-2.c: Ditto > > * gcc.target/riscv/rvv/autovec/cond/cond_fmin-3.c: Ditto > > * gcc.target/riscv/rvv/autovec/cond/cond_fmin-4.c: Ditto > > * gcc.target/riscv/rvv/autovec/cond/cond_fms_fnms-1.c: Ditto > > * gcc.target/riscv/rvv/autovec/cond/cond_fms_fnms-3.c: Ditto > > * gcc.target/riscv/rvv/autovec/cond/cond_fms_fnms-4.c: Ditto > > * gcc.target/riscv/rvv/autovec/cond/cond_fms_fnms-5.c: Ditto > > * gcc.target/riscv/rvv/autovec/cond/cond_fms_fnms-6.c: Ditto > > * gcc.target/riscv/rvv/autovec/cond/cond_fmul-1.c: Ditto > > * gcc.target/riscv/rvv/autovec/cond/cond_fmul-2.c: Ditto > > * gcc.target/riscv/rvv/autovec/cond/cond_fmul-3.c: Ditto > > * gcc.target/riscv/rvv/autovec/cond/cond_fmul-4.c: Ditto > > * gcc.target/riscv/rvv/autovec/cond/cond_fmul-5.c: Ditto > It looks like vadd-rv32gcv-nofm still isn't quite right according to the > pre-commit testing: > > > > https://github.com/ewlu/gcc-precommit-ci/issues/1931#issuecomment-2238752679 > > > OK once that's fixed. No need to wait for another review cycle. > > And a note. We need to be careful as some uarchs may pay a penalty when > the vector unit needs to get an operand from the GP or FP register > files. So there could well be cases where using .vf or .vx forms is > slower. Consider these two scenarios. > > First, we broadcast from the GP/FP across a vector regsiter outside a > loop, the use a .vv form in the loop. > > Second we use a .vf or .vx form in the loop instead without any broadcast. > > In the former case we only pay the penalty for crossing register files > once. In the second case we'd pay it for every iteration of the loop. > > Given this is going to be uarch sensitive, I don't mind biasing towards > the .vx/.vf forms right now, but we may need to add some costing models > to this in the future as we can test on a wider variety of uarchs.
Just wanted to chime in to say that this should indeed be a tuning decision, but our mental model should bias us in favor of the .vf/.vx forms when we don't have any additional information. It's a safe assumption that, for all uarches, it's better to use a .vf/.vx form if the scalar operand is used only once. If the scalar is loop-invariant, then it's definitely uarch-dependent as to whether a hoisted splat is preferable to repeated use of .vf/.vx. (For SiFive's in-order vector units, the splat is pure overhead; the .vf/.vx forms are preferred. I know the same is not true of other uarches, though.) There's the additional complicating factor: when the scalar operand comes from memory, some uarches will prefer to use a strided load with rs2=x0, rather than a scalar load followed by .vf/.vx, or a scalar load followed by a splat. (For SiFive's in-order vector units, this optimization is profitable when the load is a cache miss, and it's a de-optimization otherwise. It isn't a case that's easy to tune for, so thus far we've relegated it to hand-written code.) > > > jeff >