Hi, Richard.

I tried hard in RISC-V backend. I found to fix the case with 
-march=rv64gcv_zvl4096b can not be without vec_to_scalar count.

Is there an approach that we can count vec_to_scalar cost without this piece 
code in middle-end ?

      /* ???  Enable for loop costing as well.  */
      if (!loop_vinfo)
        record_stmt_cost (cost_vec, 1, vec_to_scalar, stmt_info, NULL_TREE,
                          0, vect_epilogue);

Since it's stage 4, I guess we can't change this now.



juzhe.zh...@rivai.ai
 
From: Richard Biener
Date: 2024-01-11 17:57
To: Robin Dapp
CC: juzhe.zh...@rivai.ai; gcc-patches; kito.cheng; Kito.cheng; jeffreyalaw
Subject: Re: [PATCH] RISC-V: Increase scalar_to_vec_cost from 1 to 3
On Thu, Jan 11, 2024 at 10:52 AM Robin Dapp <rdapp....@gmail.com> wrote:
>
> On 1/11/24 10:46, juzhe.zh...@rivai.ai wrote:
> > Oh. I see I think I have done wrong here.
> >
> > I should adjust cost for VEC_EXTRACT not VEC_SET.
> >
> > But it's odd, I didn't see loop vectorizer is scanning scalar_to_vec
> > cost in vect.dump.
>
> The slidedown/vmv.x.s part is of course vec_extract but we indeed
> don't seem to cost it as vec_to_scalar here.
 
It looks like a vectorized live operation as it's not in the loop body
(and thus really irrelevant for costing in practice).  This has
 
      /* ???  Enable for loop costing as well.  */
      if (!loop_vinfo)
        record_stmt_cost (cost_vec, 1, vec_to_scalar, stmt_info, NULL_TREE,
                          0, vect_epilogue);
 
so live ops are not costed at all.  I would suggest to try unconditionally
enabling this?
 
> vmv.vx correspond to scalar_to_vec and I'd say 3 seems a
> bit high when a regular vector instruction is "1".
> It should rather be dependent on the latency between register
> files.  We can't really say in general but I'd say "2" is not so bad.
>
> I would suggest adding special handling in builtin_vectorization_cost
> like:
>
> /* Add register-register latency.  */
> case scalar_to_vec:
>   return common_costs->scalar_to_vec_cost + riscv_register_move_cost (...)
>
> and adjust register_move_cost accordingly.  Instead of using
> register_move_cost we could also use a cost structure directly.
> (E.g. like aarch64's regmove tuning structures.  Those don't
> contain VRs but for us it could make sense to add them).
>
> > +/* { dg-options "-march=rv64gcv_zvl256b -mabi=lp64d -O3 -ftree-vectorize 
> > -fdump-tree-vect-details" } */
> With a cost of "3" we still vectorize for zvl512b and larger.
> Is that intended?  I don't really see why 512 should vectorized
> but 256 not.  Disregarding that everything should be optimized
> away, 2 iterations for the whole loop with 256 bits doesn't
> seem that bad.
>
> Regards
>  Robin
>
 

Reply via email to