On 2021-07-12 18:02, Richard Biener wrote:
On Mon, 12 Jul 2021, guojiufu wrote:

On 2021-07-12 16:57, Richard Biener wrote:
> On Mon, 12 Jul 2021, guojiufu wrote:
>
>> On 2021-07-12 14:20, Richard Biener wrote:
>> > On Fri, 9 Jul 2021, Segher Boessenkool wrote:
>> >
>> >> On Fri, Jul 09, 2021 at 08:43:59AM +0200, Richard Biener wrote:
>> >> > I wonder if there's a way to query the target what modes the doloop
>> >> > pattern can handle (not being too familiar with the doloop code).
>> >>
>> >> You can look what modes are allowed for operand 0 of doloop_end,
>> >> perhaps?  Although that is a define_expand, not a define_insn, so it is
>> >> hard to introspect.
>> >>
>> >> > Why do you need to do any checks besides the new type being able to
>> >> > represent all IV values?  The original doloop IV will never wrap
>> >> > (OTOH if niter is U*_MAX then we compute niter + 1 which will become
>> >> > zero ... I suppose the doloop might still do the correct thing here
>> >> > but it also still will with a IV with larger type).
>>
>> The issue comes from U*_MAX (original short MAX), as you said: on which
>> niter + 1 becomes zero.  And because the step for doloop is -1; then, on
>> larger type 'zero - 1' will be a very large number on larger type
>> (e.g. 0xff...ff); but on the original short type 'zero - 1' is a small
>> value
>> (e.g. "0xff").
>
> But for the larger type the small type MAX + 1 fits and does not yield
> zero so it should still work exactly as before, no?  Of course you
> have to compute the + 1 in the larger type.
>
You are right, if compute the "+ 1" in the larger type it is ok, as below
code:
```
   /* Use type in word size may fast.  */
    if (TYPE_PRECISION (ntype) < BITS_PER_WORD)
      {
        ntype = lang_hooks.types.type_for_size (BITS_PER_WORD, 1);
        niter = fold_convert (ntype, niter);
      }

    tree base = fold_build2 (PLUS_EXPR, ntype, unshare_expr (niter),
                             build_int_cst (ntype, 1));


add_candidate (data, base, build_int_cst (ntype, -1), true, NULL, NULL,
true);
```
The issue of this is, this code generates more stmt for doloop.xxx:
  _12 = (unsigned int) xx(D);
  _10 = _12 + 4294967295;
  _24 = (long unsigned int) _10;
  doloop.6_8 = _24 + 1;

if use previous patch, "+ 1" on original type, then the stmts will looks like:
  _12 = (unsigned int) xx(D);
  doloop.6_8 = (long unsigned int) _12;

This is the reason for checking
   wi::ltu_p (niter_desc->max, wi::to_widest (TYPE_MAX_VALUE (ntype)))

But this then only works when there's an upper bound on the number
of iterations.  Note you should not use TYPE_MAX_VALUE here but
you can instead use

     wi::ltu_p (niter_desc->max, wi::to_widest (wi::max_value
(TYPE_PRECISION (ntype), TYPE_SIGN (ntype))));

Ok, Thanks!
I remember you mentioned that:
widest_int::from (wi::max_value (TYPE_PRECISION (ntype), TYPE_SIGN (ntype)), TYPE_SIGN (ntype))
would be better than
wi::to_widest (TYPE_MAX_VALUE (ntype)).

It seems that:
"TYPE_MAX_VALUE (ntype)" is "NUMERICAL_TYPE_CHECK (NODE)->type_non_common.maxval" which do a numerical-check and return the field of maxval. And then call to
wi::to_widest

The other code "widest_int::from (wi::max_value (..,..),..)" calls wi::max_value
and widest_int::from.

I'm wondering if wi::to_widest (TYPE_MAX_VALUE (ntype)) is cheaper?


I think the -1 above comes from number of latch iterations vs. header
entries - it's a common source for this kind of issues.  range analysis
might be able to prove that we can still merge the two adds even with
the intermediate extension.
Yes, as you mentioned here, it relates to number of latch iterations
For loop looks like : while (l < n) or for (i = 0; i < n; i++)
This kind of loop, the niter is used to be 'n - 1' after transformed
into 'do-while' form.
I would see how to merge these two adds safely at this point
when generating doloop iv. (maybe range info, thanks!


Is this pre-loop extra add really offsetting the in-loop doloop
improvements?
I'm not catching this question too much, sorry.  I guess your concern
is if the "+1" is an offset: it may not, "+1" may be just that doloop.xx
is decreasing niter until 0 (all number >0).
If misunderstand,  thanks for point out.


>> >>
>> >> doloop_valid_p guarantees it is simple and doesn't wrap.
>> >>
>> >> > I'd have expected sth like
>> >> >
>> >> >    ntype = lang_hooks.types.type_for_mode (word_mode, TYPE_UNSIGNED
>> >> > (ntype));
>> >> >
>> >> > thus the decision made using a mode - which is also why I wonder
>> >> > if there's a way to query the target for this.  As you say,
>> >> > it _may_ be fast, so better check (somehow).
>>
>>
>> I was also thinking of using hooks like type_for_size/type_for_mode.
>>     /* Use type in word size may fast.  */
>>     if (TYPE_PRECISION (ntype) < BITS_PER_WORD
>>         && Wi::ltu_p (niter_desc->max, wi::to_widest (TYPE_MAX_VALUE
>> (ntype))))
>>       {
>>         ntype = lang_hooks.types.type_for_size (BITS_PER_WORD, 1);
>>         base = fold_convert (ntype, base);
>>       }
>>
>> As you pointed out, this does not query the mode from targets.
>> As Segher pointed out "doloop_end" checks unsupported mode, while it seems
>> not easy to use it in tree-ssa-loop-ivopts.c.
>> For implementations of doloop_end, tartgets like rs6000/aarch64/ia64
>> requires
>> Pmode/DImode; while there are other targets that work on other 'mode' (e.g.
>> SI).
>>
>>
>> In doloop_optimize, there is code:
>>
>> ```
>>     mode = desc->mode;
>> .....
>>     doloop_reg = gen_reg_rtx (mode);
>>     rtx_insn *doloop_seq = targetm.gen_doloop_end (doloop_reg,
>> start_label);
>>
>>     word_mode_size = GET_MODE_PRECISION (word_mode);
>>     word_mode_max = (HOST_WIDE_INT_1U << (word_mode_size - 1) << 1) -
>> 1;
>>     if (! doloop_seq
>>         && mode != word_mode
>>         /* Before trying mode different from the one in that # of
>> iterations is
>>            computed, we must be sure that the number of iterations fits
>> into
>>            the new mode.  */
>>         && (word_mode_size >= GET_MODE_PRECISION (mode)
>>             || wi::leu_p (iterations_max, word_mode_max)))
>>       {
>>         if (word_mode_size > GET_MODE_PRECISION (mode))
>>           count = simplify_gen_unary (ZERO_EXTEND, word_mode, count, mode);
>>         else
>>           count = lowpart_subreg (word_mode, count, mode);
>>         PUT_MODE (doloop_reg, word_mode);
>>         doloop_seq = targetm.gen_doloop_end (doloop_reg, start_label);
>>       }
>>     if (! doloop_seq)
>>       {
>>         if (dump_file)
>>           fprintf (dump_file,
>>                    "Doloop: Target unwilling to use doloop pattern!\n");
>>         return false;
>>       }
>> ```
>> The above code first tries the mode of niter_desc by call
>> targetm.gen_doloop_end
>> to see if the target can generate doloop insns, if fail, then try to use
>> 'word_mode' against gen_doloop_end.
>>
>>
>> >>
>> >> Almost all targets just use Pmode, but there is no such guarantee I
>> >> think, and esp. some targets that do not have machine insns for this
>> >> (but want to generate different code for this anyway) can do pretty much
>> >> anything.
>> >>
>> >> Maybe using just Pmode here is good enough though?
>> >
>> > I think Pmode is a particularly bad choice and I'd prefer word_mode
>> > if we go for any hardcoded mode.  s390x for example seems to handle
>> > both SImode and DImode (but names the helper gen_doloop_si64
>> > for SImode?!).  But indeed it looks like somehow querying doloop_end
>> > is going to be difficult since the expander doesn't have any mode,
>> > so we'd have to actually try emit RTL here.
>>
>> Instead of using hardcode mode, maybe we could add a hook for targets to
>> return
>> the preferred mode.
>
> That's a possiblity of course.  Like the following which just shows the
> default implementation then (pass in current mode, return a more preferred
> mode or the mode itself)
>
> enum machine_mode
> prefered_doloop_mode (enum machine_mode mode)
> {
>   return mode;
> }
>
Yes, thanks!

Checking current do_loop_end in targets, in general, when do_loop_end requires
SI mode, the target is defining Pmode as SImode and word_mode (from
BITS_PER_WORD
which defaults from UNITS_PER_WORD) is also defined to align with SI mode. When do_loop_end requires DI mode, the target is defining Pmode as DImode
and word_mode/UNITS_PER_WORD is also defined to align with DI mode.

So, if aggressively, then by default we may just return word_mode.

Note we still have to check whether the prefered mode is valid
(passing in TImode but then returning DImode would be wrong).
Yes, some code like in doloop_optimize (rtl code)
```
mode != word_mode
&& (word_mode_size >= GET_MODE_PRECISION (mode)
    || wi::leu_p (iterations_max, word_mode_max))
```

Thanks again for your comments!

BR,
Jiufu Guo

Richard.

BR,

Jiufu Guo.


>>
>> Thanks for those valuable comments!
>>
>> Jiufu Guo
>>
>>
>>
>> >
>> > Richard.
>>
>>

Reply via email to