On Thu, Feb 11, 2021 at 11:19 AM Alexandre Oliva <ol...@adacore.com> wrote: > > On Feb 4, 2021, Alexandre Oliva <ol...@adacore.com> wrote: > > > On Feb 4, 2021, Richard Biener <richard.guent...@gmail.com> wrote: > >>> > b) if expansion would use BY_PIECES then expand to an unrolled loop > >>> > >>> Why would that be better than keeping the constant-length memset call, > >>> that would be turned into an unrolled loop during expand? > > >> Well, because of the possibly lost ctz and alignment info. > > > Funny you should mention that. I got started with the expand-time > > expansion yesterday, and found out that we're not using the alignment > > information that is available. Though the pointer is known to point to > > an aligned object, we are going for 8-bit alignment for some reason. > > > The strategy I used there was to first check whether by_pieces would > > expand inline a constant length near the max known length, then loop > > over the bits in the variable length, expand in each iteration a > > constant-length store-by-pieces for the fixed length corresponding to > > that bit, and a test comparing the variable length with the fixed length > > guarding the expansion of the store-by-pieces. We may get larger code > > this way (no loops), but only O(log(len)) compares. > > > I've also fixed some bugs in the ldist expander, so now it bootstraps, > > but with a few regressions in the testsuite, that I'm yet to look into. > > A few more fixes later, this seems to work. > > It encompasses the patch to recover tree_ctz from a mult_expr def, it > adds code to set up the alignment data for the ldist-generated memset > dest, and then it expands memset as described above if length is not > constant, setmem is not available, but the by-pieces machinery would > still be used for nearby constants. > > How does this look?
Overall it looks good - I can't really review the RTL code generation parts because I'm not too familiar with the usual pitfalls there. Some comments below. > > [PR94092] introduce try store by multiple pieces > > From: Alexandre Oliva <ol...@adacore.com> > > The ldist pass turns even very short loops into memset calls. E.g., > the TFmode emulation calls end with a loop of up to 3 iterations, to > zero out trailing words, and the loop distribution pass turns them > into calls of the memset builtin. > > Though short constant-length memsets are usually dealt with > efficiently, for non-constant-length ones, the options are setmemM, or > a function calls. > > RISC-V doesn't have any setmemM pattern, so the loops above end up > "optimized" into memset calls, incurring not only the overhead of an > explicit call, but also discarding the information the compiler has > about the alignment of the destination, and that the length is a > multiple of the word alignment. > > This patch handles variable lengths with multiple conditional > power-of-2-constant-sized stores-by-pieces, so as to reduce the > overhead of length compares. > > It also propagates the alignment of the memset dest, introduced by > loop-distribution, to the SSA pointer used to hold it, so that it is > not discarded right away, and may be recovered by the memset builtin > expander. > > Finally, it computes the ctz of the length, and uses it to omit blocks > for stores of small byte counts when we're storing whole words. > tree_ctz is improved so as to look at the length DEF stmt, and zero > out the least-significant bits if it's a multiply by a power of two. > > > for gcc/ChangeLog > > PR tree-optimization/94092 > * builtins.c (try_store_by_multiple_pieces): New. > (expand_builtin_memset_args): Use it. If target_char_cast > fails, proceed as for non-constant val. Pass len's ctz to... > * expr.c (clear_storage_hints): ... this. Try store by > multiple pieces after setmem. > (clear_storage): Adjust. > * expr.h (clear_storage_hints): Likewise. > (try_store_by_multiple_pieces): Declare. > * tree-loop-distribution.c: Include builtins.h. > (generate_memset_builtin): Propagate dst_base alignmen to mem. > * tree-ssanames.c (get_nonzero_bits): Zero out low bits of > integral types, when a MULT_EXPR INTEGER_CST operand ensures > the result will be a multiple of a power of two. > --- > gcc/builtins.c | 113 > +++++++++++++++++++++++++++++++++++++++--- > gcc/expr.c | 9 +++ > gcc/expr.h | 12 ++++ > gcc/tree-loop-distribution.c | 9 +++ > gcc/tree-ssanames.c | 23 ++++++++- > 5 files changed, 154 insertions(+), 12 deletions(-) > > diff --git a/gcc/builtins.c b/gcc/builtins.c > index 0aed008687ccc..44f3af92536a5 100644 > --- a/gcc/builtins.c > +++ b/gcc/builtins.c > @@ -6600,6 +6600,97 @@ expand_builtin_memset (tree exp, rtx target, > machine_mode mode) > return expand_builtin_memset_args (dest, val, len, target, mode, exp); > } > > +/* Try to store VAL (or, if NULL_RTX, VALC) in LEN bytes starting at TO. > + Return TRUE if successful, FALSE otherwise. TO is assumed to be > + aligned at an ALIGN boundary. LEN is assumed to be a multiple of > + 1<<CTZ_LEN, and between min_len and max_len. > + > + The strategy is to issue one store_by_pieces for each power of two, > + for most to least significant, guarded by a test on whether there are > + at least that many bytes to copy. */ > + > +bool > +try_store_by_multiple_pieces (rtx to, rtx len, int ctz_len, > + unsigned HOST_WIDE_INT min_len, > + unsigned HOST_WIDE_INT max_len, > + rtx val, char valc, unsigned int align) > +{ > + int max_bits = floor_log2 (max_len); > + int min_bits = floor_log2 (min_len); > + > + if (val) > + valc = 1; > + > + if (!can_store_by_pieces ((HOST_WIDE_INT_1U << max_bits) * 2 > + - (HOST_WIDE_INT_1U << ctz_len), > + builtin_memset_read_str, &valc, align, true)) > + return false; > + > + rtx (*constfun) (void *, HOST_WIDE_INT, scalar_int_mode); > + void *constfundata; > + if (val) > + { > + constfun = builtin_memset_gen_str; > + constfundata = val = force_reg (TYPE_MODE (unsigned_char_type_node), > + val); > + } > + else > + { > + constfun = builtin_memset_read_str; > + constfundata = &valc; > + } > + > + /* Bits above SHR_BITS don't need to be tested. */ > + int shr_bits = (max_bits != min_bits > + ? max_bits > + : floor_log2 (max_len ^ min_len)); > + > + rtx ptr = copy_addr_to_reg (convert_to_mode (ptr_mode, XEXP (to, 0), 0)); > + rtx rem = copy_to_mode_reg (ptr_mode, convert_to_mode (ptr_mode, len, 0)); > + set_mem_align (to, align); > + > + for (int i = max_bits; i >= ctz_len; i--) > + { > + unsigned HOST_WIDE_INT blksize = HOST_WIDE_INT_1U << i; > + rtx_code_label *label = NULL; > + > + if (i <= shr_bits) > + { > + label = gen_label_rtx (); > + emit_cmp_and_jump_insns (rem, GEN_INT (blksize), LT, NULL, > + ptr_mode, 1, label, > + profile_probability::even ()); > + } > + else if ((max_len & blksize) == 0) > + continue; > + > + if (align > blksize * BITS_PER_UNIT) > + { > + align = blksize * BITS_PER_UNIT; > + set_mem_align (to, align); > + } > + > + to = store_by_pieces (to, blksize, > + constfun, constfundata, > + align, true, RETURN_END); > + > + if (label) > + { > + emit_move_insn (ptr, XEXP (to, 0)); > + XEXP (to, 0) = ptr; > + > + clear_mem_offset (to); > + clear_mem_size (to); > + > + emit_move_insn (rem, plus_constant (ptr_mode, rem, -blksize)); > + > + emit_label (label); > + } > + } > + > + return true; > +} > + > /* Helper function to do the actual work for expand_builtin_memset. The > arguments to the builtin_memset call DEST, VAL, and LEN are broken out > so that this can also be called without constructing an actual CALL_EXPR. > @@ -6654,7 +6745,8 @@ expand_builtin_memset_args (tree dest, tree val, tree > len, > dest_mem = get_memory_rtx (dest, len); > val_mode = TYPE_MODE (unsigned_char_type_node); > > - if (TREE_CODE (val) != INTEGER_CST) > + if (TREE_CODE (val) != INTEGER_CST > + || target_char_cast (val, &c)) > { > rtx val_rtx; > > @@ -6678,7 +6770,12 @@ expand_builtin_memset_args (tree dest, tree val, tree > len, > else if (!set_storage_via_setmem (dest_mem, len_rtx, val_rtx, > dest_align, expected_align, > expected_size, min_size, max_size, > - probable_max_size)) > + probable_max_size) > + && !try_store_by_multiple_pieces (dest_mem, len_rtx, > + tree_ctz (len), > + min_size, max_size, > + val_rtx, 0, > + dest_align)) > goto do_libcall; > > dest_mem = force_operand (XEXP (dest_mem, 0), NULL_RTX); > @@ -6686,9 +6783,6 @@ expand_builtin_memset_args (tree dest, tree val, tree > len, > return dest_mem; > } > > - if (target_char_cast (val, &c)) > - goto do_libcall; > - > if (c) > { > if (tree_fits_uhwi_p (len) > @@ -6702,7 +6796,12 @@ expand_builtin_memset_args (tree dest, tree val, tree > len, > gen_int_mode (c, val_mode), > dest_align, expected_align, > expected_size, min_size, max_size, > - probable_max_size)) > + probable_max_size) > + && !try_store_by_multiple_pieces (dest_mem, len_rtx, > + tree_ctz (len), > + min_size, max_size, > + NULL_RTX, c, > + dest_align)) > goto do_libcall; > > dest_mem = force_operand (XEXP (dest_mem, 0), NULL_RTX); > @@ -6716,7 +6815,7 @@ expand_builtin_memset_args (tree dest, tree val, tree > len, > ? BLOCK_OP_TAILCALL : BLOCK_OP_NORMAL, > expected_align, expected_size, > min_size, max_size, > - probable_max_size); > + probable_max_size, tree_ctz (len)); > > if (dest_addr == 0) > { > diff --git a/gcc/expr.c b/gcc/expr.c > index 04ef5ad114d06..b212e9ac575a7 100644 > --- a/gcc/expr.c > +++ b/gcc/expr.c > @@ -3056,7 +3056,8 @@ clear_storage_hints (rtx object, rtx size, enum > block_op_methods method, > unsigned int expected_align, HOST_WIDE_INT expected_size, > unsigned HOST_WIDE_INT min_size, > unsigned HOST_WIDE_INT max_size, > - unsigned HOST_WIDE_INT probable_max_size) > + unsigned HOST_WIDE_INT probable_max_size, > + unsigned ctz_size) > { > machine_mode mode = GET_MODE (object); > unsigned int align; > @@ -3103,6 +3104,10 @@ clear_storage_hints (rtx object, rtx size, enum > block_op_methods method, > expected_align, expected_size, > min_size, max_size, probable_max_size)) > ; > + else if (try_store_by_multiple_pieces (object, size, ctz_size, > + min_size, max_size, > + NULL_RTX, 0, align)) > + ; > else if (ADDR_SPACE_GENERIC_P (MEM_ADDR_SPACE (object))) > return set_storage_via_libcall (object, size, const0_rtx, > method == BLOCK_OP_TAILCALL); > @@ -3120,7 +3125,7 @@ clear_storage (rtx object, rtx size, enum > block_op_methods method) > min = max = UINTVAL (size); > else > max = GET_MODE_MASK (GET_MODE (size)); > - return clear_storage_hints (object, size, method, 0, -1, min, max, max); > + return clear_storage_hints (object, size, method, 0, -1, min, max, max, 0); > } > > > diff --git a/gcc/expr.h b/gcc/expr.h > index 1f0177a4cfa5d..6ff2384df63f8 100644 > --- a/gcc/expr.h > +++ b/gcc/expr.h > @@ -193,7 +193,8 @@ extern rtx clear_storage_hints (rtx, rtx, enum > block_op_methods, > unsigned int, HOST_WIDE_INT, > unsigned HOST_WIDE_INT, > unsigned HOST_WIDE_INT, > - unsigned HOST_WIDE_INT); > + unsigned HOST_WIDE_INT, > + unsigned); > /* The same, but always output an library call. */ > extern rtx set_storage_via_libcall (rtx, rtx, rtx, bool = false); > > @@ -224,6 +225,15 @@ extern int can_store_by_pieces (unsigned HOST_WIDE_INT, > extern rtx store_by_pieces (rtx, unsigned HOST_WIDE_INT, by_pieces_constfn, > void *, unsigned int, bool, memop_ret); > > +/* If can_store_by_pieces passes for worst-case values near MAX_LEN, call > + store_by_pieces within conditionals so as to handle variable LEN > efficiently, > + storing VAL, if non-NULL_RTX, or valc instead. */ > +extern bool try_store_by_multiple_pieces (rtx to, rtx len, int ctz_len, > + unsigned HOST_WIDE_INT min_len, > + unsigned HOST_WIDE_INT max_len, > + rtx val, char valc, > + unsigned int align); > + > /* Emit insns to set X from Y. */ > extern rtx_insn *emit_move_insn (rtx, rtx); > extern rtx_insn *gen_move_insn (rtx, rtx); > diff --git a/gcc/tree-loop-distribution.c b/gcc/tree-loop-distribution.c > index bb15fd3723fb6..a64b89037a724 100644 > --- a/gcc/tree-loop-distribution.c > +++ b/gcc/tree-loop-distribution.c > @@ -115,6 +115,7 @@ along with GCC; see the file COPYING3. If not see > #include "tree-vectorizer.h" > #include "tree-eh.h" > #include "gimple-fold.h" > +#include "builtins.h" > > > #define MAX_DATAREFS_NUM \ > @@ -1155,6 +1156,14 @@ generate_memset_builtin (class loop *loop, partition > *partition) > mem = force_gimple_operand_gsi (&gsi, mem, true, NULL_TREE, > false, GSI_CONTINUE_LINKING); > > + if (TREE_CODE (mem) == SSA_NAME) > + if (ptr_info_def *pi = get_ptr_info (mem)) > + { > + unsigned al = get_pointer_alignment (builtin->dst_base); > + if (al > pi->align || pi->misalign) We still might prefer pi->align == 64 and pi->misalign == 32 over al == 16 so maybe factor that in, too. Also what's still lost is the alignment constraint the original access has (thus also the base pointer must have). Maybe store this info in loop distributions builtin_info and compute it from the data reference for example via get_object_alignment (DR_REF (dr)). > + set_ptr_info_alignment (pi, al, 0); > + } > + > /* This exactly matches the pattern recognition in classify_partition. */ > val = gimple_assign_rhs1 (DR_STMT (builtin->dst_dr)); > /* Handle constants like 0x15151515 and similarly > diff --git a/gcc/tree-ssanames.c b/gcc/tree-ssanames.c > index 51a26d2fce1c2..c4b5bf2a4999a 100644 > --- a/gcc/tree-ssanames.c > +++ b/gcc/tree-ssanames.c > @@ -546,10 +546,29 @@ get_nonzero_bits (const_tree name) > } > > range_info_def *ri = SSA_NAME_RANGE_INFO (name); > + wide_int ret; > if (!ri) > - return wi::shwi (-1, precision); > + ret = wi::shwi (-1, precision); > + else > + ret = ri->get_nonzero_bits (); > + > + /* If NAME is defined as a multiple of a constant C, we know the ctz(C) low > + bits are zero. ??? Should we handle LSHIFT_EXPR too? Non-constants, > + e.g. the minimum shift count, and ctz from both MULT_EXPR operands? > That > + could make for deep recursion. */ > + if (INTEGRAL_TYPE_P (TREE_TYPE (name)) > + && SSA_NAME_DEF_STMT (name) > + && is_gimple_assign (SSA_NAME_DEF_STMT (name)) > + && gimple_assign_rhs_code (SSA_NAME_DEF_STMT (name)) == MULT_EXPR > + && TREE_CODE (gimple_assign_rhs2 (SSA_NAME_DEF_STMT (name))) == > INTEGER_CST) > + { > + unsigned HOST_WIDE_INT bits > + = tree_ctz (gimple_assign_rhs2 (SSA_NAME_DEF_STMT (name))); > + wide_int mask = wi::shwi (-1, precision) << bits; > + ret &= mask; > + } > > - return ri->get_nonzero_bits (); > + return ret; > } So I wonder whether we should instead re-run CCP after loop opts which computes nonzero bits as well instead of the above "hack". Would nonzero bits be ready to compute in the above way from loop distribution? That is, you can do set_nonzero_bits just like you did set_ptr_info_alignment ... Since CCP also performs copy propagation an obvious candidate would be to replace the last pass_copy_prop with pass_ccp (with a comment noting to compute up-to-date nonzero bits info). Thanks, Richard. > /* Return TRUE is OP, an SSA_NAME has a range of values [0..1], false > > > -- > Alexandre Oliva, happy hacker https://FSFLA.org/blogs/lxo/ > Free Software Activist GNU Toolchain Engineer > Vim, Vi, Voltei pro Emacs -- GNUlius Caesar