On Mon, Jan 31, 2022 at 2:16 PM Tom de Vries <tdevr...@suse.de> wrote: > > On 9/17/21 10:08, Richard Biener via Gcc-patches wrote: > > On Mon, Sep 13, 2021 at 4:53 PM Stefan Schulze Frielinghaus > > <stefa...@linux.ibm.com> wrote: > >> > >> On Mon, Sep 06, 2021 at 11:56:21AM +0200, Richard Biener wrote: > >>> On Fri, Sep 3, 2021 at 10:01 AM Stefan Schulze Frielinghaus > >>> <stefa...@linux.ibm.com> wrote: > >>>> > >>>> On Fri, Aug 20, 2021 at 12:35:58PM +0200, Richard Biener wrote: > >>>> [...] > >>>>>>> > >>>>>>> + /* Handle strlen like loops. */ > >>>>>>> + if (store_dr == NULL > >>>>>>> + && integer_zerop (pattern) > >>>>>>> + && TREE_CODE (reduction_iv.base) == INTEGER_CST > >>>>>>> + && TREE_CODE (reduction_iv.step) == INTEGER_CST > >>>>>>> + && integer_onep (reduction_iv.step) > >>>>>>> + && (types_compatible_p (TREE_TYPE (reduction_var), > >>>>>>> size_type_node) > >>>>>>> + || TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var)))) > >>>>>>> + { > >>>>>>> > >>>>>>> I wonder what goes wrong with a larger or smaller wrapping IV type? > >>>>>>> The iteration > >>>>>>> only stops when you load a NUL and the increments just wrap along > >>>>>>> (you're > >>>>>>> using the pointer IVs to compute the strlen result). Can't you > >>>>>>> simply truncate? > >>>>>> > >>>>>> I think truncation is enough as long as no overflow occurs in strlen or > >>>>>> strlen_using_rawmemchr. > >>>>>> > >>>>>>> For larger than size_type_node (actually larger than ptr_type_node > >>>>>>> would matter > >>>>>>> I guess), the argument is that since pointer wrapping would be > >>>>>>> undefined anyway > >>>>>>> the IV cannot wrap either. Now, the correct check here would IMHO be > >>>>>>> > >>>>>>> TYPE_PRECISION (TREE_TYPE (reduction_var)) < TYPE_PRECISION > >>>>>>> (ptr_type_node) > >>>>>>> || TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (pointer-iv-var)) > >>>>>>> > >>>>>>> ? > >>>>>> > >>>>>> Regarding the implementation which makes use of rawmemchr: > >>>>>> > >>>>>> We can count at most PTRDIFF_MAX many bytes without an overflow. Thus, > >>>>>> the maximal length we can determine of a string where each character > >>>>>> has > >>>>>> size S is PTRDIFF_MAX / S without an overflow. Since an overflow for > >>>>>> ptrdiff type is undefined we have to make sure that if an overflow > >>>>>> occurs, then an overflow occurs for reduction variable, too, and that > >>>>>> this is undefined, too. However, I'm not sure anymore whether we want > >>>>>> to respect overflows in all cases. If TYPE_PRECISION (ptr_type_node) > >>>>>> equals TYPE_PRECISION (ptrdiff_type_node) and an overflow occurs, then > >>>>>> this would mean that a single string consumes more than half of the > >>>>>> virtual addressable memory. At least for architectures where > >>>>>> TYPE_PRECISION (ptrdiff_type_node) == 64 holds, I think it is > >>>>>> reasonable > >>>>>> to neglect the case where computing pointer difference may overflow. > >>>>>> Otherwise we are talking about strings with lenghts of multiple > >>>>>> pebibytes. For other architectures we might have to be more precise > >>>>>> and make sure that reduction variable overflows first and that this is > >>>>>> undefined. > >>>>>> > >>>>>> Thus a conservative condition would be (I assumed that the size of any > >>>>>> integral type is a power of two which I'm not sure if this really > >>>>>> holds; > >>>>>> IIRC the C standard requires only that the alignment is a power of two > >>>>>> but not necessarily the size so I might need to change this): > >>>>>> > >>>>>> /* Compute precision (reduction_var) < (precision (ptrdiff_type) - 1 - > >>>>>> log2 (sizeof (load_type)) > >>>>>> or in other words return true if reduction variable overflows first > >>>>>> and false otherwise. */ > >>>>>> > >>>>>> static bool > >>>>>> reduction_var_overflows_first (tree reduction_var, tree load_type) > >>>>>> { > >>>>>> unsigned precision_ptrdiff = TYPE_PRECISION (ptrdiff_type_node); > >>>>>> unsigned precision_reduction_var = TYPE_PRECISION (TREE_TYPE > >>>>>> (reduction_var)); > >>>>>> unsigned size_exponent = wi::exact_log2 (wi::to_wide > >>>>>> (TYPE_SIZE_UNIT (load_type))); > >>>>>> return wi::ltu_p (precision_reduction_var, precision_ptrdiff - 1 - > >>>>>> size_exponent); > >>>>>> } > >>>>>> > >>>>>> TYPE_PRECISION (ptrdiff_type_node) == 64 > >>>>>> || (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var)) > >>>>>> && reduction_var_overflows_first (reduction_var, load_type) > >>>>>> > >>>>>> Regarding the implementation which makes use of strlen: > >>>>>> > >>>>>> I'm not sure what it means if strlen is called for a string with a > >>>>>> length greater than SIZE_MAX. Therefore, similar to the implementation > >>>>>> using rawmemchr where we neglect the case of an overflow for 64bit > >>>>>> architectures, a conservative condition would be: > >>>>>> > >>>>>> TYPE_PRECISION (size_type_node) == 64 > >>>>>> || (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var)) > >>>>>> && TYPE_PRECISION (reduction_var) <= TYPE_PRECISION > >>>>>> (size_type_node)) > >>>>>> > >>>>>> I still included the overflow undefined check for reduction variable in > >>>>>> order to rule out situations where the reduction variable is unsigned > >>>>>> and overflows as many times until strlen(,_using_rawmemchr) overflows, > >>>>>> too. Maybe this is all theoretical nonsense but I'm afraid of uncommon > >>>>>> architectures. Anyhow, while writing this down it becomes clear that > >>>>>> this deserves a comment which I will add once it becomes clear which > >>>>>> way > >>>>>> to go. > >>>>> > >>>>> I think all the arguments about objects bigger than half of the > >>>>> address-space > >>>>> also are valid for 32bit targets and thus 32bit size_type_node (or > >>>>> 32bit pointer size). > >>>>> I'm not actually sure what's the canonical type to check against, > >>>>> whether > >>>>> it's size_type_node (Cs size_t), ptr_type_node (Cs void *) or sizetype > >>>>> (the > >>>>> middle-end "offset" type used for all address computations). For weird > >>>>> reasons > >>>>> I'd lean towards 'sizetype' (for example some embedded targets have > >>>>> 24bit > >>>>> pointers but 16bit 'sizetype'). > >>>> > >>>> Ok, for the strlen implementation I changed from size_type_node to > >>>> sizetype and assume that no overflow occurs for string objects bigger > >>>> than half of the address space for 32-bit targets and up: > >>>> > >>>> (TYPE_PRECISION (sizetype) >= TYPE_PRECISION (ptr_type_node) - 1 > >>>> && TYPE_PRECISION (ptr_type_node) >= 32) > >>>> || (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var)) > >>>> && TYPE_PRECISION (reduction_var) <= TYPE_PRECISION (sizetype)) > >>>> > >>>> and similarly for the rawmemchr implementation: > >>>> > >>>> (TYPE_PRECISION (ptrdiff_type_node) == TYPE_PRECISION (ptr_type_node) > >>>> && TYPE_PRECISION (ptrdiff_type_node) >= 32) > >>>> || (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var)) > >>>> && reduction_var_overflows_first (reduction_var, load_type)) > >>>> > >>>>> > >>>>>>> > >>>>>>> + if (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var))) > >>>>>>> + { > >>>>>>> + const char *msg = G_("assuming signed overflow does not > >>>>>>> occur " > >>>>>>> + "when optimizing strlen like loop"); > >>>>>>> + fold_overflow_warning (msg, WARN_STRICT_OVERFLOW_MISC); > >>>>>>> + } > >>>>>>> > >>>>>>> no, please don't add any new strict-overflow warnings ;) > >>>>>> > >>>>>> I just stumbled over code which produces such a warning and thought > >>>>>> this > >>>>>> is a hard requirement :D The new patch doesn't contain it anymore. > >>>>>> > >>>>>>> > >>>>>>> The generate_*_builtin routines need some factoring - if you > >>>>>>> code-generate > >>>>>>> into a gimple_seq you could use gimple_build () which would do the > >>>>>>> fold_stmt > >>>>>>> (not sure why you do that - you should see to fold the call, not > >>>>>>> necessarily > >>>>>>> the rest). The replacement of reduction_var and the dumping could be > >>>>>>> shared. > >>>>>>> There's also GET_MODE_NAME for the printing. > >>>>>> > >>>>>> I wasn't really sure which way to go. Use a gsi, as it is done by > >>>>>> existing generate_* functions, or make use of gimple_seq. Since the > >>>>>> latter uses internally also gsi I thought it is better to stick to gsi > >>>>>> in the first place. Now, after changing to gimple_seq I see the beauty > >>>>>> of it :) > >>>>>> > >>>>>> I created two helper functions generate_strlen_builtin_1 and > >>>>>> generate_reduction_builtin_1 in order to reduce code duplication. > >>>>>> > >>>>>> In function generate_strlen_builtin I changed from using > >>>>>> builtin_decl_implicit (BUILT_IN_STRLEN) to builtin_decl_explicit > >>>>>> (BUILT_IN_STRLEN) since the former could return a NULL pointer. I'm not > >>>>>> sure whether my intuition about the difference between implicit and > >>>>>> explicit builtins is correct. In builtins.def there is a small example > >>>>>> given which I would paraphrase as "use builtin_decl_explicit if the > >>>>>> semantics of the builtin is defined by the C standard; otherwise use > >>>>>> builtin_decl_implicit" but probably my intuition is wrong? > >>>>>> > >>>>>> Beside that I'm not sure whether I really have to call > >>>>>> build_fold_addr_expr which looks superfluous to me since > >>>>>> gimple_build_call can deal with ADDR_EXPR as well as FUNCTION_DECL: > >>>>>> > >>>>>> tree fn = build_fold_addr_expr (builtin_decl_explicit > >>>>>> (BUILT_IN_STRLEN)); > >>>>>> gimple *fn_call = gimple_build_call (fn, 1, mem); > >>>>>> > >>>>>> However, since it is also used that way in the context of > >>>>>> generate_memset_builtin I didn't remove it so far. > >>>>>> > >>>>>>> I think overall the approach is sound now but the details still need > >>>>>>> work. > >>>>>> > >>>>>> Once again thank you very much for your review. Really appreciated! > >>>>> > >>>>> The patch lacks a changelog entry / description. It's nice if patches > >>>>> sent > >>>>> out for review are basically the rev as git format-patch produces. > >>>>> > >>>>> The rawmemchr optab needs documenting in md.texi > >>>> > >>>> While writing the documentation in md.texi I realised that other > >>>> instructions expect an address to be a memory operand which is not the > >>>> case for rawmemchr currently. At the moment the address is either an > >>>> SSA_NAME or ADDR_EXPR with a tree pointer type in expand_RAWMEMCHR. As a > >>>> consequence in the backend define_expand rawmemchr<mode> expects a > >>>> register operand and not a memory operand. Would it make sense to build > >>>> a MEM_REF out of SSA_NAME/ADDR_EXPR in expand_RAWMEMCHR? Not sure if > >>>> MEM_REF is supposed to be the canonical form here. > >>> > >>> I suppose the expander could use code similar to what > >>> expand_builtin_memset_args does, > >>> using get_memory_rtx. I suppose that we're using MEM operands because > >>> those > >>> can convey things like alias info or alignment info, something which > >>> REG operands cannot > >>> (easily). I wouldn't build a MEM_REF and try to expand that. > >> > >> The new patch contains the following changes: > >> > >> - In expand_RAWMEMCHR I'm using get_memory_rtx now. This means I had to > >> change linkage of get_memory_rtx to extern. > >> > >> - In function generate_strlen_builtin_using_rawmemchr I'm not > >> reconstructing the load type anymore from the base pointer but rather > >> pass it as a parameter from function transform_reduction_loop where we > >> also ensured that it is of integral type. Reconstructing the load > >> type was error prone since e.g. I didn't distinct between > >> pointer_plus_expr or addr_expr. Thus passing the load type should be > >> more solid. > >> > >> Regtested on IBM Z and x86. Ok for mainline? > > > > OK, and sorry for all the repeated delays. > > > > I'm running into PR56888 ( > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=56888 ) on nvptx due to > this, f.i. in gcc/testsuite/gcc.c-torture/execute/builtins/strlen.c, > where gcc/testsuite/gcc.c-torture/execute/builtins/lib/strlen.c contains > a strlen function, with a strlen loop, which is transformed by > pass_loop_distribution into a __builtin_strlen, which is then expanded > into a strlen call, creating a self-recursive function. [ And on nvptx, > that happens to result in a compilation failure, which is how I found > this. ] > > According to this ( > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=56888#c21 ) comment: > ... > -fno-tree-loop-distribute-patterns is the reliable way to not > transform loops into library calls. > ... > > Then should we have something along the lines of: > ... > $ git diff > diff --git a/gcc/tree-loop-distribution.c b/gcc/tree-loop-distribution.c > index 6fe59cd56855..9a211d30cd7e 100644 > --- a/gcc/tree-loop-distribution.c > +++ b/gcc/tree-loop-distribution.c > @@ -3683,7 +3683,11 @@ loop_distribution::transform_reduction_loop > && TYPE_PRECISION (ptr_type_node) >= 32) > || (TYPE_OVERFLOW_UNDEFINED (reduction_var_type) > && TYPE_PRECISION (reduction_var_type) <= > TYPE_PRECISION (sizetype))) > - && builtin_decl_implicit (BUILT_IN_STRLEN)) > + && builtin_decl_implicit (BUILT_IN_STRLEN) > + && flag_tree_loop_distribute_patterns) > generate_strlen_builtin (loop, reduction_var, load_iv.base, > reduction_iv.base, loc); > else if (direct_optab_handler (rawmemchr_optab, TYPE_MODE > (load_type)) > ... > ? > > Or is the comment no longer valid?
It is still valid - and yes, I think we need to guard it with this flag but please do it in the caller to transform_reduction_loop. > > Thanks, > - Tom