Thanks Richard for comments. > Enabling it via match.pd looks possible but also possibly sub-optimal > for costing side on the > vectorizer - supporting it directly in the vectorizer can be done later > though.
Sure, will have a try in v2. Pan -----Original Message----- From: Richard Biener <richard.guent...@gmail.com> Sent: Thursday, October 17, 2024 3:13 PM To: Li, Pan2 <pan2...@intel.com> Cc: Richard Sandiford <richard.sandif...@arm.com>; gcc-patches@gcc.gnu.org; juzhe.zh...@rivai.ai; kito.ch...@gmail.com; tamar.christ...@arm.com Subject: Re: [PATCH v1] Internal-fn: Add new IFN mask_len_strided_load/store On Thu, Oct 17, 2024 at 8:38 AM Li, Pan2 <pan2...@intel.com> wrote: > > It is quit a while since last discussion. > I recall these materials recently and have a try in the risc-v backend. > > 1 │ void foo (int * __restrict a, int * __restrict b, int stride, int n) > 2 │ { > 3 │ for (int i = 0; i < n; i++) > 4 │ a[i*stride] = b[i*stride] + 100; > 5 │ } > > We will have expand similar as below for VEC_SERIES_EXPR + > MASK_LEN_GATHER_LOAD. > There will be 8 insns after expand which is not applicable when try_combine > (at most 4 insn) if > my understand is correct. > > Thus, is there any other approaches instead of adding new IFN? If we need to > add new IFN, can > we leverage match.pd to try to match the MASK_LEN_GATHER_LOAD(base, > VEC_SERICES_EXPR, ...) > pattern and then emit the new IFN like sat alu does. Adding an optab (and direct internal fn) is fine I guess - it should be modeled after the gather optab specifying the vec_series is implicit with the then scalar stride. Enabling it via match.pd looks possible but also possibly sub-optimal for costing side on the vectorizer - supporting it directly in the vectorizer can be done later though. Richard. > Thanks a lot. > > 316 │ ;; _58 = VEC_SERIES_EXPR <0, _57>; > 317 │ > 318 │ (insn 17 16 18 (set (reg:DI 156 [ _56 ]) > 319 │ (ashiftrt:DI (reg:DI 141 [ _54 ]) > 320 │ (const_int 2 [0x2]))) -1 > 321 │ (expr_list:REG_EQUAL (div:DI (reg:DI 141 [ _54 ]) > 322 │ (const_int 4 [0x4])) > 323 │ (nil))) > 324 │ > 325 │ (insn 18 17 19 (set (reg:DI 158) > 326 │ (unspec:DI [ > 327 │ (const_int 32 [0x20]) > 328 │ ] UNSPEC_VLMAX)) -1 > 329 │ (nil)) > 330 │ > 331 │ (insn 19 18 20 (set (reg:RVVM1SI 157) > 332 │ (if_then_else:RVVM1SI (unspec:RVVMF32BI [ > 333 │ (const_vector:RVVMF32BI repeat [ > 334 │ (const_int 1 [0x1]) > 335 │ ]) > 336 │ (reg:DI 158) > 337 │ (const_int 2 [0x2]) repeated x2 > 338 │ (const_int 1 [0x1]) > 339 │ (reg:SI 66 vl) > 340 │ (reg:SI 67 vtype) > 341 │ ] UNSPEC_VPREDICATE) > 342 │ (vec_series:RVVM1SI (const_int 0 [0]) > 343 │ (const_int 1 [0x1])) > 344 │ (unspec:RVVM1SI [ > 345 │ (reg:DI 0 zero) > 346 │ ] UNSPEC_VUNDEF))) -1 > 347 │ (nil)) > 348 │ > 349 │ (insn 20 19 21 (set (reg:DI 160) > 350 │ (unspec:DI [ > 351 │ (const_int 32 [0x20]) > 352 │ ] UNSPEC_VLMAX)) -1 > 353 │ (nil)) > 354 │ > 355 │ (insn 21 20 22 (set (reg:RVVM1SI 159) > 356 │ (if_then_else:RVVM1SI (unspec:RVVMF32BI [ > 357 │ (const_vector:RVVMF32BI repeat [ > 358 │ (const_int 1 [0x1]) > 359 │ ]) > 360 │ (reg:DI 160) > 361 │ (const_int 2 [0x2]) repeated x2 > 362 │ (const_int 1 [0x1]) > 363 │ (reg:SI 66 vl) > 364 │ (reg:SI 67 vtype) > 365 │ ] UNSPEC_VPREDICATE) > 366 │ (mult:RVVM1SI (vec_duplicate:RVVM1SI (subreg:SI (reg:DI > 156 [ _56 ]) 0)) > 367 │ (reg:RVVM1SI 157)) > 368 │ (unspec:RVVM1SI [ > 369 │ (reg:DI 0 zero) > 370 │ ] UNSPEC_VUNDEF))) -1 > 371 │ (nil)) > ... > 403 │ ;; vect__5.16_61 = .MASK_LEN_GATHER_LOAD (vectp_b.14_59, _58, 4, { > 0, ... }, { -1, ... }, _73, 0); > 404 │ > 405 │ (insn 27 26 28 (set (reg:RVVM2DI 161) > 406 │ (sign_extend:RVVM2DI (reg:RVVM1SI 145 [ _58 ]))) > "strided_ld-st.c":4:22 -1 > 407 │ (nil)) > 408 │ > 409 │ (insn 28 27 29 (set (reg:RVVM2DI 162) > 410 │ (ashift:RVVM2DI (reg:RVVM2DI 161) > 411 │ (const_int 2 [0x2]))) "strided_ld-st.c":4:22 -1 > 412 │ (nil)) > 413 │ > 414 │ (insn 29 28 0 (set (reg:RVVM1SI 146 [ vect__5.16 ]) > 415 │ (if_then_else:RVVM1SI (unspec:RVVMF32BI [ > 416 │ (const_vector:RVVMF32BI repeat [ > 417 │ (const_int 1 [0x1]) > 418 │ ]) > 419 │ (reg:DI 149 [ _73 ]) > 420 │ (const_int 2 [0x2]) repeated x2 > 421 │ (const_int 0 [0]) > 422 │ (reg:SI 66 vl) > 423 │ (reg:SI 67 vtype) > 424 │ ] UNSPEC_VPREDICATE) > 425 │ (unspec:RVVM1SI [ > 426 │ (reg/v/f:DI 151 [ b ]) > 427 │ (mem:BLK (scratch) [0 A8]) > 428 │ (reg:RVVM2DI 162) > 429 │ ] UNSPEC_UNORDERED) > 430 │ (unspec:RVVM1SI [ > 431 │ (reg:DI 0 zero) > 432 │ ] UNSPEC_VUNDEF))) "strided_ld-st.c":4:22 -1 > 433 │ (nil)) > > Pan > > > -----Original Message----- > From: Li, Pan2 <pan2...@intel.com> > Sent: Wednesday, June 5, 2024 3:50 PM > To: Richard Biener <richard.guent...@gmail.com>; Richard Sandiford > <richard.sandif...@arm.com> > Cc: gcc-patches@gcc.gnu.org; juzhe.zh...@rivai.ai; kito.ch...@gmail.com; > tamar.christ...@arm.com > Subject: RE: [PATCH v1] Internal-fn: Add new IFN mask_len_strided_load/store > > Looks not easy to get the original context/history, only catch some shadow > from below patch but not the fully picture. > > https://gcc.gnu.org/pipermail/gcc-patches/2023-October/634683.html > > It is reasonable to me that using gather/scatter with a VEC_SERICES, for > example as blow, will have a try for this. > > operand_0 = mask_gather_loadmn (ptr, offset, 1/0(sign/unsign), multiply, mask) > offset = (vec_series:m base step) => base + i * step > op_0[i] = memory[ptr + offset[i] * multiply] && mask[i] > > operand_0 = mask_len_strided_load (ptr, stride, mask, len, bias). > op_0[i] = memory[prt + stride * i] && mask[i] && i < (len + bias) > > Pan > > -----Original Message----- > From: Li, Pan2 > Sent: Wednesday, June 5, 2024 9:18 AM > To: Richard Biener <richard.guent...@gmail.com>; Richard Sandiford > <richard.sandif...@arm.com> > Cc: gcc-patches@gcc.gnu.org; juzhe.zh...@rivai.ai; kito.ch...@gmail.com; > tamar.christ...@arm.com > Subject: RE: [PATCH v1] Internal-fn: Add new IFN mask_len_strided_load/store > > > Sorry if we have discussed this last year already - is there anything wrong > > with using a gather/scatter with a VEC_SERIES gimple/rtl def for the offset? > > Thanks for comments, it is quit a while since last discussion. Let me recall > a little about it and keep you posted. > > Pan > > -----Original Message----- > From: Richard Biener <richard.guent...@gmail.com> > Sent: Tuesday, June 4, 2024 9:22 PM > To: Li, Pan2 <pan2...@intel.com>; Richard Sandiford > <richard.sandif...@arm.com> > Cc: gcc-patches@gcc.gnu.org; juzhe.zh...@rivai.ai; kito.ch...@gmail.com; > tamar.christ...@arm.com > Subject: Re: [PATCH v1] Internal-fn: Add new IFN mask_len_strided_load/store > > On Tue, May 28, 2024 at 5:15 AM <pan2...@intel.com> wrote: > > > > From: Pan Li <pan2...@intel.com> > > > > This patch would like to add new internal fun for the below 2 IFN. > > * mask_len_strided_load > > * mask_len_strided_store > > > > The GIMPLE v = MASK_LEN_STRIDED_LOAD (ptr, stride, mask, len, bias) will > > be expanded into v = mask_len_strided_load (ptr, stried, mask, len, bias). > > > > The GIMPLE MASK_LEN_STRIED_STORE (ptr, stride, v, mask, len, bias) > > be expanded into mask_len_stried_store (ptr, stride, v, mask, len, bias). > > > > The below test suites are passed for this patch: > > * The x86 bootstrap test. > > * The x86 fully regression test. > > * The riscv fully regression test. > > Sorry if we have discussed this last year already - is there anything wrong > with using a gather/scatter with a VEC_SERIES gimple/rtl def for the offset? > > Richard. > > > gcc/ChangeLog: > > > > * doc/md.texi: Add description for mask_len_strided_load/store. > > * internal-fn.cc (strided_load_direct): New internal_fn define > > for strided_load_direct. > > (strided_store_direct): Ditto but for store. > > (expand_strided_load_optab_fn): New expand func for > > mask_len_strided_load. > > (expand_strided_store_optab_fn): Ditto but for store. > > (direct_strided_load_optab_supported_p): New define for load > > direct optab supported. > > (direct_strided_store_optab_supported_p): Ditto but for store. > > (internal_fn_len_index): Add len index for both load and store. > > (internal_fn_mask_index): Ditto but for mask index. > > (internal_fn_stored_value_index): Add stored index. > > * internal-fn.def (MASK_LEN_STRIDED_LOAD): New direct fn define > > for strided_load. > > (MASK_LEN_STRIDED_STORE): Ditto but for stride_store. > > * optabs.def (OPTAB_D): New optab define for load and store. > > > > Signed-off-by: Pan Li <pan2...@intel.com> > > Co-Authored-By: Juzhe-Zhong <juzhe.zh...@rivai.ai> > > --- > > gcc/doc/md.texi | 27 ++++++++++++++++ > > gcc/internal-fn.cc | 75 +++++++++++++++++++++++++++++++++++++++++++++ > > gcc/internal-fn.def | 6 ++++ > > gcc/optabs.def | 2 ++ > > 4 files changed, 110 insertions(+) > > > > diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi > > index 5730bda80dc..3d242675c63 100644 > > --- a/gcc/doc/md.texi > > +++ b/gcc/doc/md.texi > > @@ -5138,6 +5138,20 @@ Bit @var{i} of the mask is set if element @var{i} of > > the result should > > be loaded from memory and clear if element @var{i} of the result should be > > undefined. > > Mask elements @var{i} with @var{i} > (operand 6 + operand 7) are ignored. > > > > +@cindex @code{mask_len_strided_load@var{m}} instruction pattern > > +@item @samp{mask_len_strided_load@var{m}} > > +Load several separate memory locations into a destination vector of mode > > @var{m}. > > +Operand 0 is a destination vector of mode @var{m}. > > +Operand 1 is a scalar base address and operand 2 is a scalar stride of > > Pmode. > > +operand 3 is mask operand, operand 4 is length operand and operand 5 is > > bias operand. > > +The instruction can be seen as a special case of > > @code{mask_len_gather_load@var{m}@var{n}} > > +with an offset vector that is a @code{vec_series} with operand 1 as base > > and operand 2 as step. > > +For each element index i load address is operand 1 + @var{i} * operand 2. > > +Similar to mask_len_load, the instruction loads at most (operand 4 + > > operand 5) elements from memory. > > +Element @var{i} of the mask (operand 3) is set if element @var{i} of the > > result should > > +be loaded from memory and clear if element @var{i} of the result should be > > zero. > > +Mask elements @var{i} with @var{i} > (operand 4 + operand 5) are ignored. > > + > > @cindex @code{scatter_store@var{m}@var{n}} instruction pattern > > @item @samp{scatter_store@var{m}@var{n}} > > Store a vector of mode @var{m} into several distinct memory locations. > > @@ -5175,6 +5189,19 @@ at most (operand 6 + operand 7) elements of (operand > > 4) to memory. > > Bit @var{i} of the mask is set if element @var{i} of (operand 4) should be > > stored. > > Mask elements @var{i} with @var{i} > (operand 6 + operand 7) are ignored. > > > > +@cindex @code{mask_len_strided_store@var{m}} instruction pattern > > +@item @samp{mask_len_strided_store@var{m}} > > +Store a vector of mode m into several distinct memory locations. > > +Operand 0 is a scalar base address and operand 1 is scalar stride of Pmode. > > +Operand 2 is the vector of values that should be stored, which is of mode > > @var{m}. > > +operand 3 is mask operand, operand 4 is length operand and operand 5 is > > bias operand. > > +The instruction can be seen as a special case of > > @code{mask_len_scatter_store@var{m}@var{n}} > > +with an offset vector that is a @code{vec_series} with operand 1 as base > > and operand 1 as step. > > +For each element index i store address is operand 0 + @var{i} * operand 1. > > +Similar to mask_len_store, the instruction stores at most (operand 4 + > > operand 5) elements of mask (operand 3) to memory. > > +Element @var{i} of the mask is set if element @var{i} of (operand 3) > > should be stored. > > +Mask elements @var{i} with @var{i} > (operand 4 + operand 5) are ignored. > > + > > @cindex @code{vec_set@var{m}} instruction pattern > > @item @samp{vec_set@var{m}} > > Set given field in the vector value. Operand 0 is the vector to modify, > > diff --git a/gcc/internal-fn.cc b/gcc/internal-fn.cc > > index 9c09026793f..f6e5329cd84 100644 > > --- a/gcc/internal-fn.cc > > +++ b/gcc/internal-fn.cc > > @@ -159,6 +159,7 @@ init_internal_fns () > > #define load_lanes_direct { -1, -1, false } > > #define mask_load_lanes_direct { -1, -1, false } > > #define gather_load_direct { 3, 1, false } > > +#define strided_load_direct { -1, -1, false } > > #define len_load_direct { -1, -1, false } > > #define mask_len_load_direct { -1, 4, false } > > #define mask_store_direct { 3, 2, false } > > @@ -168,6 +169,7 @@ init_internal_fns () > > #define vec_cond_mask_len_direct { 1, 1, false } > > #define vec_cond_direct { 2, 0, false } > > #define scatter_store_direct { 3, 1, false } > > +#define strided_store_direct { 1, 1, false } > > #define len_store_direct { 3, 3, false } > > #define mask_len_store_direct { 4, 5, false } > > #define vec_set_direct { 3, 3, false } > > @@ -3668,6 +3670,68 @@ expand_gather_load_optab_fn (internal_fn, gcall > > *stmt, direct_optab optab) > > emit_move_insn (lhs_rtx, ops[0].value); > > } > > > > +/* Expand MASK_LEN_STRIDED_LOAD call CALL by optab OPTAB. */ > > + > > +static void > > +expand_strided_load_optab_fn (ATTRIBUTE_UNUSED internal_fn, gcall *stmt, > > + direct_optab optab) > > +{ > > + tree lhs = gimple_call_lhs (stmt); > > + tree base = gimple_call_arg (stmt, 0); > > + tree stride = gimple_call_arg (stmt, 1); > > + > > + rtx lhs_rtx = expand_expr (lhs, NULL_RTX, VOIDmode, EXPAND_WRITE); > > + rtx base_rtx = expand_normal (base); > > + rtx stride_rtx = expand_normal (stride); > > + > > + unsigned i = 0; > > + class expand_operand ops[6]; > > + machine_mode mode = TYPE_MODE (TREE_TYPE (lhs)); > > + > > + create_output_operand (&ops[i++], lhs_rtx, mode); > > + create_address_operand (&ops[i++], base_rtx); > > + create_address_operand (&ops[i++], stride_rtx); > > + > > + insn_code icode = direct_optab_handler (optab, mode); > > + > > + i = add_mask_and_len_args (ops, i, stmt); > > + expand_insn (icode, i, ops); > > + > > + if (!rtx_equal_p (lhs_rtx, ops[0].value)) > > + emit_move_insn (lhs_rtx, ops[0].value); > > +} > > + > > +/* Expand MASK_LEN_STRIDED_STORE call CALL by optab OPTAB. */ > > + > > +static void > > +expand_strided_store_optab_fn (ATTRIBUTE_UNUSED internal_fn, gcall *stmt, > > + direct_optab optab) > > +{ > > + internal_fn fn = gimple_call_internal_fn (stmt); > > + int rhs_index = internal_fn_stored_value_index (fn); > > + > > + tree base = gimple_call_arg (stmt, 0); > > + tree stride = gimple_call_arg (stmt, 1); > > + tree rhs = gimple_call_arg (stmt, rhs_index); > > + > > + rtx base_rtx = expand_normal (base); > > + rtx stride_rtx = expand_normal (stride); > > + rtx rhs_rtx = expand_normal (rhs); > > + > > + unsigned i = 0; > > + class expand_operand ops[6]; > > + machine_mode mode = TYPE_MODE (TREE_TYPE (rhs)); > > + > > + create_address_operand (&ops[i++], base_rtx); > > + create_address_operand (&ops[i++], stride_rtx); > > + create_input_operand (&ops[i++], rhs_rtx, mode); > > + > > + insn_code icode = direct_optab_handler (optab, mode); > > + i = add_mask_and_len_args (ops, i, stmt); > > + > > + expand_insn (icode, i, ops); > > +} > > + > > /* Helper for expand_DIVMOD. Return true if the sequence starting with > > INSN contains any call insns or insns with {,U}{DIV,MOD} rtxes. */ > > > > @@ -4058,6 +4122,7 @@ multi_vector_optab_supported_p (convert_optab optab, > > tree_pair types, > > #define direct_load_lanes_optab_supported_p multi_vector_optab_supported_p > > #define direct_mask_load_lanes_optab_supported_p > > multi_vector_optab_supported_p > > #define direct_gather_load_optab_supported_p convert_optab_supported_p > > +#define direct_strided_load_optab_supported_p direct_optab_supported_p > > #define direct_len_load_optab_supported_p direct_optab_supported_p > > #define direct_mask_len_load_optab_supported_p convert_optab_supported_p > > #define direct_mask_store_optab_supported_p convert_optab_supported_p > > @@ -4066,6 +4131,7 @@ multi_vector_optab_supported_p (convert_optab optab, > > tree_pair types, > > #define direct_vec_cond_mask_optab_supported_p convert_optab_supported_p > > #define direct_vec_cond_optab_supported_p convert_optab_supported_p > > #define direct_scatter_store_optab_supported_p convert_optab_supported_p > > +#define direct_strided_store_optab_supported_p direct_optab_supported_p > > #define direct_len_store_optab_supported_p direct_optab_supported_p > > #define direct_mask_len_store_optab_supported_p convert_optab_supported_p > > #define direct_while_optab_supported_p convert_optab_supported_p > > @@ -4723,6 +4789,8 @@ internal_fn_len_index (internal_fn fn) > > case IFN_COND_LEN_XOR: > > case IFN_COND_LEN_SHL: > > case IFN_COND_LEN_SHR: > > + case IFN_MASK_LEN_STRIDED_LOAD: > > + case IFN_MASK_LEN_STRIDED_STORE: > > return 4; > > > > case IFN_COND_LEN_NEG: > > @@ -4817,6 +4885,10 @@ internal_fn_mask_index (internal_fn fn) > > case IFN_MASK_LEN_STORE: > > return 2; > > > > + case IFN_MASK_LEN_STRIDED_LOAD: > > + case IFN_MASK_LEN_STRIDED_STORE: > > + return 3; > > + > > case IFN_MASK_GATHER_LOAD: > > case IFN_MASK_SCATTER_STORE: > > case IFN_MASK_LEN_GATHER_LOAD: > > @@ -4840,6 +4912,9 @@ internal_fn_stored_value_index (internal_fn fn) > > { > > switch (fn) > > { > > + case IFN_MASK_LEN_STRIDED_STORE: > > + return 2; > > + > > case IFN_MASK_STORE: > > case IFN_MASK_STORE_LANES: > > case IFN_SCATTER_STORE: > > diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def > > index 25badbb86e5..b30a7a5b009 100644 > > --- a/gcc/internal-fn.def > > +++ b/gcc/internal-fn.def > > @@ -56,6 +56,7 @@ along with GCC; see the file COPYING3. If not see > > - mask_load_lanes: currently just vec_mask_load_lanes > > - mask_len_load_lanes: currently just vec_mask_len_load_lanes > > - gather_load: used for {mask_,mask_len_,}gather_load > > + - strided_load: currently just mask_len_strided_load > > - len_load: currently just len_load > > - mask_len_load: currently just mask_len_load > > > > @@ -64,6 +65,7 @@ along with GCC; see the file COPYING3. If not see > > - mask_store_lanes: currently just vec_mask_store_lanes > > - mask_len_store_lanes: currently just vec_mask_len_store_lanes > > - scatter_store: used for {mask_,mask_len_,}scatter_store > > + - strided_store: currently just mask_len_strided_store > > - len_store: currently just len_store > > - mask_len_store: currently just mask_len_store > > > > @@ -212,6 +214,8 @@ DEF_INTERNAL_OPTAB_FN (MASK_GATHER_LOAD, ECF_PURE, > > mask_gather_load, gather_load) > > DEF_INTERNAL_OPTAB_FN (MASK_LEN_GATHER_LOAD, ECF_PURE, > > mask_len_gather_load, gather_load) > > +DEF_INTERNAL_OPTAB_FN (MASK_LEN_STRIDED_LOAD, ECF_PURE, > > + mask_len_strided_load, strided_load) > > > > DEF_INTERNAL_OPTAB_FN (LEN_LOAD, ECF_PURE, len_load, len_load) > > DEF_INTERNAL_OPTAB_FN (MASK_LEN_LOAD, ECF_PURE, mask_len_load, > > mask_len_load) > > @@ -221,6 +225,8 @@ DEF_INTERNAL_OPTAB_FN (MASK_SCATTER_STORE, 0, > > mask_scatter_store, scatter_store) > > DEF_INTERNAL_OPTAB_FN (MASK_LEN_SCATTER_STORE, 0, > > mask_len_scatter_store, scatter_store) > > +DEF_INTERNAL_OPTAB_FN (MASK_LEN_STRIDED_STORE, 0, > > + mask_len_strided_store, strided_store) > > > > DEF_INTERNAL_OPTAB_FN (MASK_STORE, 0, maskstore, mask_store) > > DEF_INTERNAL_OPTAB_FN (STORE_LANES, ECF_CONST, vec_store_lanes, > > store_lanes) > > diff --git a/gcc/optabs.def b/gcc/optabs.def > > index 3f2cb46aff8..630b1de8f97 100644 > > --- a/gcc/optabs.def > > +++ b/gcc/optabs.def > > @@ -539,4 +539,6 @@ OPTAB_DC (vec_series_optab, "vec_series$a", VEC_SERIES) > > OPTAB_D (vec_shl_insert_optab, "vec_shl_insert_$a") > > OPTAB_D (len_load_optab, "len_load_$a") > > OPTAB_D (len_store_optab, "len_store_$a") > > +OPTAB_D (mask_len_strided_load_optab, "mask_len_strided_load_$a") > > +OPTAB_D (mask_len_strided_store_optab, "mask_len_strided_store_$a") > > OPTAB_D (select_vl_optab, "select_vl$a") > > -- > > 2.34.1 > >