Re: More aggressive threading causing loop-interchange-9.c regression

Richard Biener via Gcc Thu, 09 Sep 2021 00:46:42 -0700

On Thu, Sep 9, 2021 at 9:37 AM Aldy Hernandez <al...@redhat.com> wrote:
>
>
>
> On 9/9/21 8:57 AM, Richard Biener wrote:
> > On Wed, Sep 8, 2021 at 8:13 PM Michael Matz <m...@suse.de> wrote:
> >>
> >> Hello,
> >>
> >> [lame answer to self]
> >>
> >> On Wed, 8 Sep 2021, Michael Matz wrote:
> >>
> >>>>> The forward threader guards against this by simply disallowing
> >>>>> threadings that involve different loops.  As I see
> >>>>
> >>>> The thread in question (5->9->3) is all within the same outer loop,
> >>>> though. BTW, the backward threader also disallows threading across
> >>>> different loops (see path_crosses_loops variable).
> >> ...
> >>> Maybe it's possible to not disable threading over latches alltogether in
> >>> the backward threader (like it's tried now), but I haven't looked at the
> >>> specific situation here in depth, so take my view only as opinion from a
> >>> large distance :-)
> >>
> >> I've now looked at the concrete situation.  So yeah, the whole path is in
> >> the same loop, crosses the latch, _and there's code following the latch
> >> on that path_.  (I.e. the latch isn't the last block in the path).  In
> >> particular, after loop_optimizer_init() (before any threading) we have:
> >>
> >>    <bb 3> [local count: 118111600]:
> >>    # j_19 = PHI <j_13(9), 0(7)>
> >>    sum_11 = c[j_19];
> >>    if (n_10(D) > 0)
> >>      goto <bb 8>; [89.00%]
> >>    else
> >>      goto <bb 5>; [11.00%]
> >>
> >>       <bb 8> [local count: 105119324]:
> >> ...
> >>
> >>    <bb 5> [local count: 118111600]:
> >>    # sum_21 = PHI <sum_14(4), sum_11(3)>
> >>    c[j_19] = sum_21;
> >>    j_13 = j_19 + 1;
> >>    if (n_10(D) > j_13)
> >>      goto <bb 9>; [89.00%]
> >>    else
> >>      goto <bb 6>; [11.00%]
> >>
> >>    <bb 9> [local count: 105119324]:
> >>    goto <bb 3>; [100.00%]
> >>
> >> With bb9 the outer (empty) latch, bb3 the outer header, and bb8 the
> >> pre-header of inner loop, but more importantly something that's not at the
> >> start of the outer loop.
> >>
> >> Now, any thread that includes the backedge 9->3 _including_ its
> >> destination (i.e. where the backedge isn't the last to-be-redirected edge)
> >> necessarily duplicates all code from that destination onto the back edge.
> >> Here it's the load from c[j] into sum_11.
> >>
> >> The important part is the code is emitted onto the back edge,
> >> conceptually; in reality it's simply included into the (new) latch block
> >> (the duplicate of bb9, which is bb12 intermediately, then named bb7 after
> >> cfg_cleanup).
> >>
> >> That's what we can't have for some of our structural loop optimizers:
> >> there must be no code executed after the exit test (e.g. in the latch
> >> block).  (This requirement makes reasoning about which code is or isn't
> >> executed completely for an iteration trivial; simply everything in the
> >> body is always executed; e.g. loop interchange uses this to check that
> >> there are no memory references after the exit test, because those would
> >> then be only conditional and hence make loop interchange very awkward).
> >>
> >> Note that this situation can't be later rectified anymore: the duplicated
> >> instructions (because they are memory refs) must remain after the exit
> >> test.  Only by rerolling/unrotating the loop (i.e. noticing that the
> >> memory refs on the loop-entry path and on the back edge are equivalent)
> >> would that be possible, but that's something we aren't capable of.  Even
> >> if we were that would simply just revert the whole work that the threader
> >> did, so it's better to not even do that to start with.
> >>
> >> I believe something like below would be appropriate, it disables threading
> >> if the path contains a latch at the non-last position (due to being
> >> backwards on the non-first position in the array).  I.e. it disables
> >> rotating the loop if there's danger of polluting the back edge.  It might
> >> be improved if the blocks following (preceding!) the latch are themself
> >> empty because then no code is duplicated.  It might also be improved if
> >> the latch is already non-empty.  That code should probably only be active
> >> before the loop optimizers, but currently the backward threader isn't
> >> differentiating between before/after loop-optims.
> >>
> >> I haven't tested this patch at all, except that it fixes the testcase :)
> >
> > Lame comment at the current end of the thread - it's not threading through 
> > the
>
> I don't know why y'all keep using the word "lame".  On the contrary,
> these are incredibly useful explanations.  Thanks.
>
> > latch but threading through the loop header that's problematic, at least if 
> > the
> > end of the threading path ends within the loop (threading through the header
> > to the loop exit is fine).  Because in that situation you effectively 
> > created an
> > alternate loop entry.  Threading through the latch into the loop header is
> > fine but with simple latches that likely will never happen (if there are no
> > simple latches then the latch can contain the loop exit test).
> >
> > See tree-ssa-threadupdate.c:thread_block_1
> >
> >        e2 = path->last ()->e;
> >        if (!e2 || noloop_only)
> >          {
> >            /* If NOLOOP_ONLY is true, we only allow threading through the
> >               header of a loop to exit edges.  */
> >
> >            /* One case occurs when there was loop header buried in a jump
> >               threading path that crosses loop boundaries.  We do not try
> >               and thread this elsewhere, so just cancel the jump threading
> >               request by clearing the AUX field now.  */
> >            if (bb->loop_father != e2->src->loop_father
> >                && (!loop_exit_edge_p (e2->src->loop_father, e2)
> >                    || flow_loop_nested_p (bb->loop_father,
> >                                           e2->dest->loop_father)))
> >              {
> >                /* Since this case is not handled by our special code
> >                   to thread through a loop header, we must explicitly
> >                   cancel the threading request here.  */
> >                delete_jump_thread_path (path);
> >                e->aux = NULL;
> >                continue;
> >              }
>
> But this is for a threading path that crosses loop boundaries, which is
> not the case.  Perhaps we should restrict this further to threads within
> a loop?
>
> >
> > there are a lot of "useful" checks in this function and the backwards 
> > threader
> > should adopt those.  Note the backwards threader originally only did
> > FSM style threadings which are exactly those possibly "harmful" ones, 
> > forming
> > irreducible regions at worst or sub-loops at best.  That might explain the
> > lack of those checks.
>
> Also, the aforementioned checks are in jump_thread_path_registry, which
> is also shared by the backward threader.  These are thread discards
> _after_ a thread has been registered.


Yeah, that's indeed unfortunate.

>  The backward threader should also
> be using these restrictions.  Unless, I'm missing some interaction with
> the FSM/etc threading types as per the preamble to the snippet you provided:
>
>       if (((*path)[1]->type == EDGE_COPY_SRC_JOINER_BLOCK && !joiners)
>           || ((*path)[1]->type == EDGE_COPY_SRC_BLOCK && joiners))
>         continue;

Indeed.  But I understand the backwards threader does not (only) do FSM
threading now.

> You are right though, there are a lot of checks throughout the entire
> forward threader that should be audited and somehow shared.  It's on my
> back burner, but I'm running out of cycles here :-/.

yeah, it's quite a mess indeed and merging the path
validity/costing/code-transform
paths of both threaders would be incredibly useful.

Richard.

>
> Thanks.
> Aldy
>

Re: More aggressive threading causing loop-interchange-9.c regression

Reply via email to