Re: [RFC] ldist: Recognize rawmemchr loop patterns

Richard Biener via Gcc-patches Mon, 31 Jan 2022 07:01:12 -0800

On Mon, Jan 31, 2022 at 2:16 PM Tom de Vries <tdevr...@suse.de> wrote:
>
> On 9/17/21 10:08, Richard Biener via Gcc-patches wrote:
> > On Mon, Sep 13, 2021 at 4:53 PM Stefan Schulze Frielinghaus
> > <stefa...@linux.ibm.com> wrote:
> >>
> >> On Mon, Sep 06, 2021 at 11:56:21AM +0200, Richard Biener wrote:
> >>> On Fri, Sep 3, 2021 at 10:01 AM Stefan Schulze Frielinghaus
> >>> <stefa...@linux.ibm.com> wrote:
> >>>>
> >>>> On Fri, Aug 20, 2021 at 12:35:58PM +0200, Richard Biener wrote:
> >>>> [...]
> >>>>>>>
> >>>>>>> +  /* Handle strlen like loops.  */
> >>>>>>> +  if (store_dr == NULL
> >>>>>>> +      && integer_zerop (pattern)
> >>>>>>> +      && TREE_CODE (reduction_iv.base) == INTEGER_CST
> >>>>>>> +      && TREE_CODE (reduction_iv.step) == INTEGER_CST
> >>>>>>> +      && integer_onep (reduction_iv.step)
> >>>>>>> +      && (types_compatible_p (TREE_TYPE (reduction_var), 
> >>>>>>> size_type_node)
> >>>>>>> +         || TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var))))
> >>>>>>> +    {
> >>>>>>>
> >>>>>>> I wonder what goes wrong with a larger or smaller wrapping IV type?
> >>>>>>> The iteration
> >>>>>>> only stops when you load a NUL and the increments just wrap along 
> >>>>>>> (you're
> >>>>>>> using the pointer IVs to compute the strlen result).  Can't you 
> >>>>>>> simply truncate?
> >>>>>>
> >>>>>> I think truncation is enough as long as no overflow occurs in strlen or
> >>>>>> strlen_using_rawmemchr.
> >>>>>>
> >>>>>>> For larger than size_type_node (actually larger than ptr_type_node 
> >>>>>>> would matter
> >>>>>>> I guess), the argument is that since pointer wrapping would be 
> >>>>>>> undefined anyway
> >>>>>>> the IV cannot wrap either.  Now, the correct check here would IMHO be
> >>>>>>>
> >>>>>>>        TYPE_PRECISION (TREE_TYPE (reduction_var)) < TYPE_PRECISION
> >>>>>>> (ptr_type_node)
> >>>>>>>         || TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (pointer-iv-var))
> >>>>>>>
> >>>>>>> ?
> >>>>>>
> >>>>>> Regarding the implementation which makes use of rawmemchr:
> >>>>>>
> >>>>>> We can count at most PTRDIFF_MAX many bytes without an overflow.  Thus,
> >>>>>> the maximal length we can determine of a string where each character 
> >>>>>> has
> >>>>>> size S is PTRDIFF_MAX / S without an overflow.  Since an overflow for
> >>>>>> ptrdiff type is undefined we have to make sure that if an overflow
> >>>>>> occurs, then an overflow occurs for reduction variable, too, and that
> >>>>>> this is undefined, too.  However, I'm not sure anymore whether we want
> >>>>>> to respect overflows in all cases.  If TYPE_PRECISION (ptr_type_node)
> >>>>>> equals TYPE_PRECISION (ptrdiff_type_node) and an overflow occurs, then
> >>>>>> this would mean that a single string consumes more than half of the
> >>>>>> virtual addressable memory.  At least for architectures where
> >>>>>> TYPE_PRECISION (ptrdiff_type_node) == 64 holds, I think it is 
> >>>>>> reasonable
> >>>>>> to neglect the case where computing pointer difference may overflow.
> >>>>>> Otherwise we are talking about strings with lenghts of multiple
> >>>>>> pebibytes.  For other architectures we might have to be more precise
> >>>>>> and make sure that reduction variable overflows first and that this is
> >>>>>> undefined.
> >>>>>>
> >>>>>> Thus a conservative condition would be (I assumed that the size of any
> >>>>>> integral type is a power of two which I'm not sure if this really 
> >>>>>> holds;
> >>>>>> IIRC the C standard requires only that the alignment is a power of two
> >>>>>> but not necessarily the size so I might need to change this):
> >>>>>>
> >>>>>> /* Compute precision (reduction_var) < (precision (ptrdiff_type) - 1 - 
> >>>>>> log2 (sizeof (load_type))
> >>>>>>     or in other words return true if reduction variable overflows first
> >>>>>>     and false otherwise.  */
> >>>>>>
> >>>>>> static bool
> >>>>>> reduction_var_overflows_first (tree reduction_var, tree load_type)
> >>>>>> {
> >>>>>>    unsigned precision_ptrdiff = TYPE_PRECISION (ptrdiff_type_node);
> >>>>>>    unsigned precision_reduction_var = TYPE_PRECISION (TREE_TYPE 
> >>>>>> (reduction_var));
> >>>>>>    unsigned size_exponent = wi::exact_log2 (wi::to_wide 
> >>>>>> (TYPE_SIZE_UNIT (load_type)));
> >>>>>>    return wi::ltu_p (precision_reduction_var, precision_ptrdiff - 1 - 
> >>>>>> size_exponent);
> >>>>>> }
> >>>>>>
> >>>>>> TYPE_PRECISION (ptrdiff_type_node) == 64
> >>>>>> || (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var))
> >>>>>>      && reduction_var_overflows_first (reduction_var, load_type)
> >>>>>>
> >>>>>> Regarding the implementation which makes use of strlen:
> >>>>>>
> >>>>>> I'm not sure what it means if strlen is called for a string with a
> >>>>>> length greater than SIZE_MAX.  Therefore, similar to the implementation
> >>>>>> using rawmemchr where we neglect the case of an overflow for 64bit
> >>>>>> architectures, a conservative condition would be:
> >>>>>>
> >>>>>> TYPE_PRECISION (size_type_node) == 64
> >>>>>> || (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var))
> >>>>>>      && TYPE_PRECISION (reduction_var) <= TYPE_PRECISION 
> >>>>>> (size_type_node))
> >>>>>>
> >>>>>> I still included the overflow undefined check for reduction variable in
> >>>>>> order to rule out situations where the reduction variable is unsigned
> >>>>>> and overflows as many times until strlen(,_using_rawmemchr) overflows,
> >>>>>> too.  Maybe this is all theoretical nonsense but I'm afraid of uncommon
> >>>>>> architectures.  Anyhow, while writing this down it becomes clear that
> >>>>>> this deserves a comment which I will add once it becomes clear which 
> >>>>>> way
> >>>>>> to go.
> >>>>>
> >>>>> I think all the arguments about objects bigger than half of the 
> >>>>> address-space
> >>>>> also are valid for 32bit targets and thus 32bit size_type_node (or
> >>>>> 32bit pointer size).
> >>>>> I'm not actually sure what's the canonical type to check against, 
> >>>>> whether
> >>>>> it's size_type_node (Cs size_t), ptr_type_node (Cs void *) or sizetype 
> >>>>> (the
> >>>>> middle-end "offset" type used for all address computations).  For weird 
> >>>>> reasons
> >>>>> I'd lean towards 'sizetype' (for example some embedded targets have 
> >>>>> 24bit
> >>>>> pointers but 16bit 'sizetype').
> >>>>
> >>>> Ok, for the strlen implementation I changed from size_type_node to
> >>>> sizetype and assume that no overflow occurs for string objects bigger
> >>>> than half of the address space for 32-bit targets and up:
> >>>>
> >>>>    (TYPE_PRECISION (sizetype) >= TYPE_PRECISION (ptr_type_node) - 1
> >>>>     && TYPE_PRECISION (ptr_type_node) >= 32)
> >>>>    || (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var))
> >>>>        && TYPE_PRECISION (reduction_var) <= TYPE_PRECISION (sizetype))
> >>>>
> >>>> and similarly for the rawmemchr implementation:
> >>>>
> >>>>    (TYPE_PRECISION (ptrdiff_type_node) == TYPE_PRECISION (ptr_type_node)
> >>>>     && TYPE_PRECISION (ptrdiff_type_node) >= 32)
> >>>>    || (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var))
> >>>>        && reduction_var_overflows_first (reduction_var, load_type))
> >>>>
> >>>>>
> >>>>>>>
> >>>>>>> +      if (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var)))
> >>>>>>> +       {
> >>>>>>> +         const char *msg = G_("assuming signed overflow does not 
> >>>>>>> occur "
> >>>>>>> +                              "when optimizing strlen like loop");
> >>>>>>> +         fold_overflow_warning (msg, WARN_STRICT_OVERFLOW_MISC);
> >>>>>>> +       }
> >>>>>>>
> >>>>>>> no, please don't add any new strict-overflow warnings ;)
> >>>>>>
> >>>>>> I just stumbled over code which produces such a warning and thought 
> >>>>>> this
> >>>>>> is a hard requirement :D The new patch doesn't contain it anymore.
> >>>>>>
> >>>>>>>
> >>>>>>> The generate_*_builtin routines need some factoring - if you 
> >>>>>>> code-generate
> >>>>>>> into a gimple_seq you could use gimple_build () which would do the 
> >>>>>>> fold_stmt
> >>>>>>> (not sure why you do that - you should see to fold the call, not 
> >>>>>>> necessarily
> >>>>>>> the rest).  The replacement of reduction_var and the dumping could be 
> >>>>>>> shared.
> >>>>>>> There's also GET_MODE_NAME for the printing.
> >>>>>>
> >>>>>> I wasn't really sure which way to go.  Use a gsi, as it is done by
> >>>>>> existing generate_* functions, or make use of gimple_seq.  Since the
> >>>>>> latter uses internally also gsi I thought it is better to stick to gsi
> >>>>>> in the first place.  Now, after changing to gimple_seq I see the beauty
> >>>>>> of it :)
> >>>>>>
> >>>>>> I created two helper functions generate_strlen_builtin_1 and
> >>>>>> generate_reduction_builtin_1 in order to reduce code duplication.
> >>>>>>
> >>>>>> In function generate_strlen_builtin I changed from using
> >>>>>> builtin_decl_implicit (BUILT_IN_STRLEN) to builtin_decl_explicit
> >>>>>> (BUILT_IN_STRLEN) since the former could return a NULL pointer. I'm not
> >>>>>> sure whether my intuition about the difference between implicit and
> >>>>>> explicit builtins is correct.  In builtins.def there is a small example
> >>>>>> given which I would paraphrase as "use builtin_decl_explicit if the
> >>>>>> semantics of the builtin is defined by the C standard; otherwise use
> >>>>>> builtin_decl_implicit" but probably my intuition is wrong?
> >>>>>>
> >>>>>> Beside that I'm not sure whether I really have to call
> >>>>>> build_fold_addr_expr which looks superfluous to me since
> >>>>>> gimple_build_call can deal with ADDR_EXPR as well as FUNCTION_DECL:
> >>>>>>
> >>>>>> tree fn = build_fold_addr_expr (builtin_decl_explicit 
> >>>>>> (BUILT_IN_STRLEN));
> >>>>>> gimple *fn_call = gimple_build_call (fn, 1, mem);
> >>>>>>
> >>>>>> However, since it is also used that way in the context of
> >>>>>> generate_memset_builtin I didn't remove it so far.
> >>>>>>
> >>>>>>> I think overall the approach is sound now but the details still need 
> >>>>>>> work.
> >>>>>>
> >>>>>> Once again thank you very much for your review.  Really appreciated!
> >>>>>
> >>>>> The patch lacks a changelog entry / description.  It's nice if patches 
> >>>>> sent
> >>>>> out for review are basically the rev as git format-patch produces.
> >>>>>
> >>>>> The rawmemchr optab needs documenting in md.texi
> >>>>
> >>>> While writing the documentation in md.texi I realised that other
> >>>> instructions expect an address to be a memory operand which is not the
> >>>> case for rawmemchr currently. At the moment the address is either an
> >>>> SSA_NAME or ADDR_EXPR with a tree pointer type in expand_RAWMEMCHR. As a
> >>>> consequence in the backend define_expand rawmemchr<mode> expects a
> >>>> register operand and not a memory operand. Would it make sense to build
> >>>> a MEM_REF out of SSA_NAME/ADDR_EXPR in expand_RAWMEMCHR? Not sure if
> >>>> MEM_REF is supposed to be the canonical form here.
> >>>
> >>> I suppose the expander could use code similar to what
> >>> expand_builtin_memset_args does,
> >>> using get_memory_rtx.  I suppose that we're using MEM operands because 
> >>> those
> >>> can convey things like alias info or alignment info, something which
> >>> REG operands cannot
> >>> (easily).  I wouldn't build a MEM_REF and try to expand that.
> >>
> >> The new patch contains the following changes:
> >>
> >> - In expand_RAWMEMCHR I'm using get_memory_rtx now.  This means I had to
> >>    change linkage of get_memory_rtx to extern.
> >>
> >> - In function generate_strlen_builtin_using_rawmemchr I'm not
> >>    reconstructing the load type anymore from the base pointer but rather
> >>    pass it as a parameter from function transform_reduction_loop where we
> >>    also ensured that it is of integral type.  Reconstructing the load
> >>    type was error prone since e.g. I didn't distinct between
> >>    pointer_plus_expr or addr_expr.  Thus passing the load type should be
> >>    more solid.
> >>
> >> Regtested on IBM Z and x86.  Ok for mainline?
> >
> > OK, and sorry for all the repeated delays.
> >
>
> I'm running into PR56888 (
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=56888 ) on nvptx due to
> this, f.i. in gcc/testsuite/gcc.c-torture/execute/builtins/strlen.c,
> where gcc/testsuite/gcc.c-torture/execute/builtins/lib/strlen.c contains
> a strlen function, with a strlen loop, which is transformed by
> pass_loop_distribution into a __builtin_strlen, which is then expanded
> into a strlen call, creating a self-recursive function. [ And on nvptx,
> that happens to result in a compilation failure, which is how I found
> this. ]
>
> According to this (
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=56888#c21 ) comment:
> ...
> -fno-tree-loop-distribute-patterns is the reliable way to not
> transform loops into library calls.
> ...
>
> Then should we have something along the lines of:
> ...
> $ git diff
> diff --git a/gcc/tree-loop-distribution.c b/gcc/tree-loop-distribution.c
> index 6fe59cd56855..9a211d30cd7e 100644
> --- a/gcc/tree-loop-distribution.c
> +++ b/gcc/tree-loop-distribution.c
> @@ -3683,7 +3683,11 @@ loop_distribution::transform_reduction_loop
>                 && TYPE_PRECISION (ptr_type_node) >= 32)
>                || (TYPE_OVERFLOW_UNDEFINED (reduction_var_type)
>                    && TYPE_PRECISION (reduction_var_type) <=
> TYPE_PRECISION (sizetype)))
> -         && builtin_decl_implicit (BUILT_IN_STRLEN))
> +         && builtin_decl_implicit (BUILT_IN_STRLEN)
> +         && flag_tree_loop_distribute_patterns)
>          generate_strlen_builtin (loop, reduction_var, load_iv.base,
>                                   reduction_iv.base, loc);
>         else if (direct_optab_handler (rawmemchr_optab, TYPE_MODE
> (load_type))
> ...
> ?
>
> Or is the comment no longer valid?


It is still valid - and yes, I think we need to guard it with this flag
but please do it in the caller to transform_reduction_loop.

>
> Thanks,
> - Tom

Re: [RFC] ldist: Recognize rawmemchr loop patterns

Reply via email to