On 2021-07-12 22:46, Richard Biener wrote:
On Mon, 12 Jul 2021, guojiufu wrote:
On 2021-07-12 18:02, Richard Biener wrote:
> On Mon, 12 Jul 2021, guojiufu wrote:
>
>> On 2021-07-12 16:57, Richard Biener wrote:
>> > On Mon, 12 Jul 2021, guojiufu wrote:
>> >
>> >> On 2021-07-12 14:20, Richard Biener wrote:
>> >> > On Fri, 9 Jul 2021, Segher Boessenkool wrote:
>> >> >
>> >> >> On Fri, Jul 09, 2021 at 08:43:59AM +0200, Richard Biener wrote:
>> >> >> > I wonder if there's a way to query the target what modes the doloop
>> >> >> > pattern can handle (not being too familiar with the doloop code).
>> >> >>
>> >> >> You can look what modes are allowed for operand 0 of doloop_end,
>> >> >> perhaps? Although that is a define_expand, not a define_insn, so it
>> >> >> is
>> >> >> hard to introspect.
>> >> >>
>> >> >> > Why do you need to do any checks besides the new type being able to
>> >> >> > represent all IV values? The original doloop IV will never wrap
>> >> >> > (OTOH if niter is U*_MAX then we compute niter + 1 which will
>> >> >> > become
>> >> >> > zero ... I suppose the doloop might still do the correct thing here
>> >> >> > but it also still will with a IV with larger type).
>> >>
>> >> The issue comes from U*_MAX (original short MAX), as you said: on which
>> >> niter + 1 becomes zero. And because the step for doloop is -1; then, on
>> >> larger type 'zero - 1' will be a very large number on larger type
>> >> (e.g. 0xff...ff); but on the original short type 'zero - 1' is a small
>> >> value
>> >> (e.g. "0xff").
>> >
>> > But for the larger type the small type MAX + 1 fits and does not yield
>> > zero so it should still work exactly as before, no? Of course you
>> > have to compute the + 1 in the larger type.
>> >
>> You are right, if compute the "+ 1" in the larger type it is ok, as below
>> code:
>> ```
>> /* Use type in word size may fast. */
>> if (TYPE_PRECISION (ntype) < BITS_PER_WORD)
>> {
>> ntype = lang_hooks.types.type_for_size (BITS_PER_WORD, 1);
>> niter = fold_convert (ntype, niter);
>> }
>>
>> tree base = fold_build2 (PLUS_EXPR, ntype, unshare_expr (niter),
>> build_int_cst (ntype, 1));
>>
>>
>> add_candidate (data, base, build_int_cst (ntype, -1), true, NULL, NULL,
>> true);
>> ```
>> The issue of this is, this code generates more stmt for doloop.xxx:
>> _12 = (unsigned int) xx(D);
>> _10 = _12 + 4294967295;
>> _24 = (long unsigned int) _10;
>> doloop.6_8 = _24 + 1;
>>
>> if use previous patch, "+ 1" on original type, then the stmts will looks
>> like:
>> _12 = (unsigned int) xx(D);
>> doloop.6_8 = (long unsigned int) _12;
>>
>> This is the reason for checking
>> wi::ltu_p (niter_desc->max, wi::to_widest (TYPE_MAX_VALUE (ntype)))
>
> But this then only works when there's an upper bound on the number
> of iterations. Note you should not use TYPE_MAX_VALUE here but
> you can instead use
>
> wi::ltu_p (niter_desc->max, wi::to_widest (wi::max_value
> (TYPE_PRECISION (ntype), TYPE_SIGN (ntype))));
Ok, Thanks!
I remember you mentioned that:
widest_int::from (wi::max_value (TYPE_PRECISION (ntype), TYPE_SIGN
(ntype)),
TYPE_SIGN (ntype))
would be better than
wi::to_widest (TYPE_MAX_VALUE (ntype)).
It seems that:
"TYPE_MAX_VALUE (ntype)" is "NUMERICAL_TYPE_CHECK
(NODE)->type_non_common.maxval"
which do a numerical-check and return the field of maxval. And then
call to
wi::to_widest
The other code "widest_int::from (wi::max_value (..,..),..)" calls
wi::max_value
and widest_int::from.
I'm wondering if wi::to_widest (TYPE_MAX_VALUE (ntype)) is cheaper?
TYPE_MAX_VALUE can be "suprising", it does not necessarily match the
underlying modes precision. At some point we've tried to eliminate
most of its uses, not sure what the situation/position is right now.
Ok, get it, thanks.
I will use "widest_int::from (wi::max_value (..,..),..)".
> I think the -1 above comes from number of latch iterations vs. header
> entries - it's a common source for this kind of issues. range analysis
> might be able to prove that we can still merge the two adds even with
> the intermediate extension.
Yes, as you mentioned here, it relates to number of latch iterations
For loop looks like : while (l < n) or for (i = 0; i < n; i++)
This kind of loop, the niter is used to be 'n - 1' after transformed
into 'do-while' form.
For this kind of loop, the max value for the number of iteration "n - 1"
would be "max_value_type(n) - 1" which is wi::ltu than max_value_type.
This kind of loop is already common, and we could use wi::ltu (max,
max_value_type)
to check.
For loop looks like:
do ;
while (n-- > 0); /* while (n-- > low); */
The niter_desc->max will wi::eq to max_value_type, and niter would be
"n",
and then doloop.xx is 'n+1'.
I would see how to merge these two adds safely at this point
when generating doloop iv. (maybe range info, thanks!
>
> Is this pre-loop extra add really offsetting the in-loop doloop
> improvements?
I'm not catching this question too much, sorry. I guess your concern
is if the "+1" is an offset: it may not, "+1" may be just that
doloop.xx
is decreasing niter until 0 (all number >0).
If misunderstand, thanks for point out.
I'm questioning the argument that not being able to eliminate the +1-1
pair effects the overall cost improvement for the doloop IV type change
which hopefully is _not_ just loop IV setup cost but improving
performance in the loop body itself?
Yes, I think so. It would affect doloop IV setup cost, it may also
affect
loop itself on some targets, if the target prefers a changed type:
using one jump-counter instruction for wider type, but cmp/jump
instructions
for other types.
If the loop is nested in outer-loop, eliminating +1-1 would improving
outer-loop performance.
BR,
Jiufu Guo.
Richard.
>
>> >> >>
>> >> >> doloop_valid_p guarantees it is simple and doesn't wrap.
>> >> >>
>> >> >> > I'd have expected sth like
>> >> >> >
>> >> >> > ntype = lang_hooks.types.type_for_mode (word_mode, TYPE_UNSIGNED
>> >> >> > (ntype));
>> >> >> >
>> >> >> > thus the decision made using a mode - which is also why I wonder
>> >> >> > if there's a way to query the target for this. As you say,
>> >> >> > it _may_ be fast, so better check (somehow).
>> >>
>> >>
>> >> I was also thinking of using hooks like type_for_size/type_for_mode.
>> >> /* Use type in word size may fast. */
>> >> if (TYPE_PRECISION (ntype) < BITS_PER_WORD
>> >> && Wi::ltu_p (niter_desc->max, wi::to_widest (TYPE_MAX_VALUE
>> >> (ntype))))
>> >> {
>> >> ntype = lang_hooks.types.type_for_size (BITS_PER_WORD, 1);
>> >> base = fold_convert (ntype, base);
>> >> }
>> >>
>> >> As you pointed out, this does not query the mode from targets.
>> >> As Segher pointed out "doloop_end" checks unsupported mode, while it
>> >> seems
>> >> not easy to use it in tree-ssa-loop-ivopts.c.
>> >> For implementations of doloop_end, tartgets like rs6000/aarch64/ia64
>> >> requires
>> >> Pmode/DImode; while there are other targets that work on other 'mode'
>> >> (e.g.
>> >> SI).
>> >>
>> >>
>> >> In doloop_optimize, there is code:
>> >>
>> >> ```
>> >> mode = desc->mode;
>> >> .....
>> >> doloop_reg = gen_reg_rtx (mode);
>> >> rtx_insn *doloop_seq = targetm.gen_doloop_end (doloop_reg,
>> >> start_label);
>> >>
>> >> word_mode_size = GET_MODE_PRECISION (word_mode);
>> >> word_mode_max = (HOST_WIDE_INT_1U << (word_mode_size - 1) << 1) -
>> >> 1;
>> >> if (! doloop_seq
>> >> && mode != word_mode
>> >> /* Before trying mode different from the one in that # of
>> >> iterations is
>> >> computed, we must be sure that the number of iterations fits
>> >> into
>> >> the new mode. */
>> >> && (word_mode_size >= GET_MODE_PRECISION (mode)
>> >> || wi::leu_p (iterations_max, word_mode_max)))
>> >> {
>> >> if (word_mode_size > GET_MODE_PRECISION (mode))
>> >> count = simplify_gen_unary (ZERO_EXTEND, word_mode, count,
>> >> mode);
>> >> else
>> >> count = lowpart_subreg (word_mode, count, mode);
>> >> PUT_MODE (doloop_reg, word_mode);
>> >> doloop_seq = targetm.gen_doloop_end (doloop_reg, start_label);
>> >> }
>> >> if (! doloop_seq)
>> >> {
>> >> if (dump_file)
>> >> fprintf (dump_file,
>> >> "Doloop: Target unwilling to use doloop pattern!\n");
>> >> return false;
>> >> }
>> >> ```
>> >> The above code first tries the mode of niter_desc by call
>> >> targetm.gen_doloop_end
>> >> to see if the target can generate doloop insns, if fail, then try to use
>> >> 'word_mode' against gen_doloop_end.
>> >>
>> >>
>> >> >>
>> >> >> Almost all targets just use Pmode, but there is no such guarantee I
>> >> >> think, and esp. some targets that do not have machine insns for this
>> >> >> (but want to generate different code for this anyway) can do pretty
>> >> >> much
>> >> >> anything.
>> >> >>
>> >> >> Maybe using just Pmode here is good enough though?
>> >> >
>> >> > I think Pmode is a particularly bad choice and I'd prefer word_mode
>> >> > if we go for any hardcoded mode. s390x for example seems to handle
>> >> > both SImode and DImode (but names the helper gen_doloop_si64
>> >> > for SImode?!). But indeed it looks like somehow querying doloop_end
>> >> > is going to be difficult since the expander doesn't have any mode,
>> >> > so we'd have to actually try emit RTL here.
>> >>
>> >> Instead of using hardcode mode, maybe we could add a hook for targets to
>> >> return
>> >> the preferred mode.
>> >
>> > That's a possiblity of course. Like the following which just shows the
>> > default implementation then (pass in current mode, return a more
>> > preferred
>> > mode or the mode itself)
>> >
>> > enum machine_mode
>> > prefered_doloop_mode (enum machine_mode mode)
>> > {
>> > return mode;
>> > }
>> >
>> Yes, thanks!
>>
>> Checking current do_loop_end in targets, in general, when do_loop_end
>> requires
>> SI mode, the target is defining Pmode as SImode and word_mode (from
>> BITS_PER_WORD
>> which defaults from UNITS_PER_WORD) is also defined to align with SI mode.
>> When do_loop_end requires DI mode, the target is defining Pmode as DImode
>> and word_mode/UNITS_PER_WORD is also defined to align with DI mode.
>>
>> So, if aggressively, then by default we may just return word_mode.
>
> Note we still have to check whether the prefered mode is valid
> (passing in TImode but then returning DImode would be wrong).
Yes, some code like in doloop_optimize (rtl code)
```
mode != word_mode
&& (word_mode_size >= GET_MODE_PRECISION (mode)
|| wi::leu_p (iterations_max, word_mode_max))
```
Thanks again for your comments!
BR,
Jiufu Guo
>
> Richard.
>
>> BR,
>>
>> Jiufu Guo.
>>
>>
>> >>
>> >> Thanks for those valuable comments!
>> >>
>> >> Jiufu Guo
>> >>
>> >>
>> >>
>> >> >
>> >> > Richard.
>> >>
>> >>
>>