On Fri, Jul 19, 2024 at 11:08 AM Jeff Law <jeffreya...@gmail.com> wrote:
>
>
>
> On 7/19/24 2:55 AM, demin.han wrote:
> > Currently, some binops of vector vs double scalar under RV32 can't
> > translated to vf but vfmv+vxx.vv.
> >
> > The cause is that vec_duplicate is also expanded to broadcast for double 
> > mode
> > under RV32. last-combine can't process expanded broadcast.
> >
> > gcc/ChangeLog:
> >
> >       * config/riscv/vector.md: Add !FLOAT_MODE_P constrain
> >
> > gcc/testsuite/ChangeLog:
> >
> >       * gcc.target/riscv/rvv/autovec/binop/vadd-rv32gcv-nofm.c: Fix test
> >       * gcc.target/riscv/rvv/autovec/binop/vdiv-rv32gcv-nofm.c: Ditto
> >       * gcc.target/riscv/rvv/autovec/binop/vmul-rv32gcv-nofm.c: Ditto
> >       * gcc.target/riscv/rvv/autovec/binop/vsub-rv32gcv-nofm.c: Ditto
> >       * gcc.target/riscv/rvv/autovec/cond/cond_copysign-rv32gcv.c: Ditto
> >       * gcc.target/riscv/rvv/autovec/cond/cond_fadd-1.c: Ditto
> >       * gcc.target/riscv/rvv/autovec/cond/cond_fadd-2.c: Ditto
> >       * gcc.target/riscv/rvv/autovec/cond/cond_fadd-3.c: Ditto
> >       * gcc.target/riscv/rvv/autovec/cond/cond_fadd-4.c: Ditto
> >       * gcc.target/riscv/rvv/autovec/cond/cond_fma_fnma-1.c: Ditto
> >       * gcc.target/riscv/rvv/autovec/cond/cond_fma_fnma-3.c: Ditto
> >       * gcc.target/riscv/rvv/autovec/cond/cond_fma_fnma-4.c: Ditto
> >       * gcc.target/riscv/rvv/autovec/cond/cond_fma_fnma-5.c: Ditto
> >       * gcc.target/riscv/rvv/autovec/cond/cond_fma_fnma-6.c: Ditto
> >       * gcc.target/riscv/rvv/autovec/cond/cond_fmax-1.c: Ditto
> >       * gcc.target/riscv/rvv/autovec/cond/cond_fmax-2.c: Ditto
> >       * gcc.target/riscv/rvv/autovec/cond/cond_fmax-3.c: Ditto
> >       * gcc.target/riscv/rvv/autovec/cond/cond_fmax-4.c: Ditto
> >       * gcc.target/riscv/rvv/autovec/cond/cond_fmin-1.c: Ditto
> >       * gcc.target/riscv/rvv/autovec/cond/cond_fmin-2.c: Ditto
> >       * gcc.target/riscv/rvv/autovec/cond/cond_fmin-3.c: Ditto
> >       * gcc.target/riscv/rvv/autovec/cond/cond_fmin-4.c: Ditto
> >       * gcc.target/riscv/rvv/autovec/cond/cond_fms_fnms-1.c: Ditto
> >       * gcc.target/riscv/rvv/autovec/cond/cond_fms_fnms-3.c: Ditto
> >       * gcc.target/riscv/rvv/autovec/cond/cond_fms_fnms-4.c: Ditto
> >       * gcc.target/riscv/rvv/autovec/cond/cond_fms_fnms-5.c: Ditto
> >       * gcc.target/riscv/rvv/autovec/cond/cond_fms_fnms-6.c: Ditto
> >       * gcc.target/riscv/rvv/autovec/cond/cond_fmul-1.c: Ditto
> >       * gcc.target/riscv/rvv/autovec/cond/cond_fmul-2.c: Ditto
> >       * gcc.target/riscv/rvv/autovec/cond/cond_fmul-3.c: Ditto
> >       * gcc.target/riscv/rvv/autovec/cond/cond_fmul-4.c: Ditto
> >       * gcc.target/riscv/rvv/autovec/cond/cond_fmul-5.c: Ditto
> It looks like vadd-rv32gcv-nofm still isn't quite right according to the
> pre-commit testing:
>
>   >
> https://github.com/ewlu/gcc-precommit-ci/issues/1931#issuecomment-2238752679
>
>
> OK once that's fixed.  No need to wait for another review cycle.
>
> And a note.  We need to be careful as some uarchs may pay a penalty when
> the vector unit needs to get an operand from the GP or FP register
> files.  So there could well be cases where using .vf or .vx forms is
> slower.  Consider these two scenarios.
>
> First, we broadcast from the GP/FP across a vector regsiter outside a
> loop, the use a .vv form in the loop.
>
> Second we use a .vf or .vx form in the loop instead without any broadcast.
>
> In the former case we only pay the penalty for crossing register files
> once.  In the second case we'd pay it for every iteration of the loop.
>
> Given this is going to be uarch sensitive, I don't mind biasing towards
> the .vx/.vf forms right now, but we may need to add some costing models
> to this in the future as we can test on a wider variety of uarchs.

Just wanted to chime in to say that this should indeed be a tuning
decision, but our mental model should bias us in favor of the .vf/.vx
forms when we don't have any additional information.

It's a safe assumption that, for all uarches, it's better to use a
.vf/.vx form if the scalar operand is used only once.  If the scalar
is loop-invariant, then it's definitely uarch-dependent as to whether
a hoisted splat is preferable to repeated use of .vf/.vx.  (For
SiFive's in-order vector units, the splat is pure overhead; the
.vf/.vx forms are preferred.  I know the same is not true of other
uarches, though.)

There's the additional complicating factor: when the scalar operand
comes from memory, some uarches will prefer to use a strided load with
rs2=x0, rather than a scalar load followed by .vf/.vx, or a scalar
load followed by a splat.  (For SiFive's in-order vector units, this
optimization is profitable when the load is a cache miss, and it's a
de-optimization otherwise.  It isn't a case that's easy to tune for,
so thus far we've relegated it to hand-written code.)


>
>
> jeff
>

Reply via email to