Hi, Richard. I tried hard in RISC-V backend. I found to fix the case with -march=rv64gcv_zvl4096b can not be without vec_to_scalar count.
Is there an approach that we can count vec_to_scalar cost without this piece code in middle-end ? /* ??? Enable for loop costing as well. */ if (!loop_vinfo) record_stmt_cost (cost_vec, 1, vec_to_scalar, stmt_info, NULL_TREE, 0, vect_epilogue); Since it's stage 4, I guess we can't change this now. juzhe.zh...@rivai.ai From: Richard Biener Date: 2024-01-11 17:57 To: Robin Dapp CC: juzhe.zh...@rivai.ai; gcc-patches; kito.cheng; Kito.cheng; jeffreyalaw Subject: Re: [PATCH] RISC-V: Increase scalar_to_vec_cost from 1 to 3 On Thu, Jan 11, 2024 at 10:52 AM Robin Dapp <rdapp....@gmail.com> wrote: > > On 1/11/24 10:46, juzhe.zh...@rivai.ai wrote: > > Oh. I see I think I have done wrong here. > > > > I should adjust cost for VEC_EXTRACT not VEC_SET. > > > > But it's odd, I didn't see loop vectorizer is scanning scalar_to_vec > > cost in vect.dump. > > The slidedown/vmv.x.s part is of course vec_extract but we indeed > don't seem to cost it as vec_to_scalar here. It looks like a vectorized live operation as it's not in the loop body (and thus really irrelevant for costing in practice). This has /* ??? Enable for loop costing as well. */ if (!loop_vinfo) record_stmt_cost (cost_vec, 1, vec_to_scalar, stmt_info, NULL_TREE, 0, vect_epilogue); so live ops are not costed at all. I would suggest to try unconditionally enabling this? > vmv.vx correspond to scalar_to_vec and I'd say 3 seems a > bit high when a regular vector instruction is "1". > It should rather be dependent on the latency between register > files. We can't really say in general but I'd say "2" is not so bad. > > I would suggest adding special handling in builtin_vectorization_cost > like: > > /* Add register-register latency. */ > case scalar_to_vec: > return common_costs->scalar_to_vec_cost + riscv_register_move_cost (...) > > and adjust register_move_cost accordingly. Instead of using > register_move_cost we could also use a cost structure directly. > (E.g. like aarch64's regmove tuning structures. Those don't > contain VRs but for us it could make sense to add them). > > > +/* { dg-options "-march=rv64gcv_zvl256b -mabi=lp64d -O3 -ftree-vectorize > > -fdump-tree-vect-details" } */ > With a cost of "3" we still vectorize for zvl512b and larger. > Is that intended? I don't really see why 512 should vectorized > but 256 not. Disregarding that everything should be optimized > away, 2 iterations for the whole loop with 256 bits doesn't > seem that bad. > > Regards > Robin >