Re: More aggressive threading causing loop-interchange-9.c regression

Jeff Law via Gcc Thu, 09 Sep 2021 10:00:36 -0700



On 9/9/2021 2:58 AM, Richard Biener wrote:

On Thu, Sep 9, 2021 at 10:36 AM Aldy Hernandez <al...@redhat.com> wrote:



On 9/9/21 9:45 AM, Richard Biener wrote:

On Thu, Sep 9, 2021 at 9:37 AM Aldy Hernandez <al...@redhat.com> wrote:



On 9/9/21 8:57 AM, Richard Biener wrote:

On Wed, Sep 8, 2021 at 8:13 PM Michael Matz <m...@suse.de> wrote:

Hello,

[lame answer to self]

On Wed, 8 Sep 2021, Michael Matz wrote:

The forward threader guards against this by simply disallowing
threadings that involve different loops.  As I see

The thread in question (5->9->3) is all within the same outer loop,
though. BTW, the backward threader also disallows threading across
different loops (see path_crosses_loops variable).

...

Maybe it's possible to not disable threading over latches alltogether in
the backward threader (like it's tried now), but I haven't looked at the
specific situation here in depth, so take my view only as opinion from a
large distance :-)

I've now looked at the concrete situation.  So yeah, the whole path is in
the same loop, crosses the latch, _and there's code following the latch
on that path_.  (I.e. the latch isn't the last block in the path).  In
particular, after loop_optimizer_init() (before any threading) we have:

     <bb 3> [local count: 118111600]:
     # j_19 = PHI <j_13(9), 0(7)>
     sum_11 = c[j_19];
     if (n_10(D) > 0)
       goto <bb 8>; [89.00%]
     else
       goto <bb 5>; [11.00%]

        <bb 8> [local count: 105119324]:
...

     <bb 5> [local count: 118111600]:
     # sum_21 = PHI <sum_14(4), sum_11(3)>
     c[j_19] = sum_21;
     j_13 = j_19 + 1;
     if (n_10(D) > j_13)
       goto <bb 9>; [89.00%]
     else
       goto <bb 6>; [11.00%]

     <bb 9> [local count: 105119324]:
     goto <bb 3>; [100.00%]

With bb9 the outer (empty) latch, bb3 the outer header, and bb8 the
pre-header of inner loop, but more importantly something that's not at the
start of the outer loop.

Now, any thread that includes the backedge 9->3 _including_ its
destination (i.e. where the backedge isn't the last to-be-redirected edge)
necessarily duplicates all code from that destination onto the back edge.
Here it's the load from c[j] into sum_11.

The important part is the code is emitted onto the back edge,
conceptually; in reality it's simply included into the (new) latch block
(the duplicate of bb9, which is bb12 intermediately, then named bb7 after
cfg_cleanup).

That's what we can't have for some of our structural loop optimizers:
there must be no code executed after the exit test (e.g. in the latch
block).  (This requirement makes reasoning about which code is or isn't
executed completely for an iteration trivial; simply everything in the
body is always executed; e.g. loop interchange uses this to check that
there are no memory references after the exit test, because those would
then be only conditional and hence make loop interchange very awkward).

Note that this situation can't be later rectified anymore: the duplicated
instructions (because they are memory refs) must remain after the exit
test.  Only by rerolling/unrotating the loop (i.e. noticing that the
memory refs on the loop-entry path and on the back edge are equivalent)
would that be possible, but that's something we aren't capable of.  Even
if we were that would simply just revert the whole work that the threader
did, so it's better to not even do that to start with.

I believe something like below would be appropriate, it disables threading
if the path contains a latch at the non-last position (due to being
backwards on the non-first position in the array).  I.e. it disables
rotating the loop if there's danger of polluting the back edge.  It might
be improved if the blocks following (preceding!) the latch are themself
empty because then no code is duplicated.  It might also be improved if
the latch is already non-empty.  That code should probably only be active
before the loop optimizers, but currently the backward threader isn't
differentiating between before/after loop-optims.

I haven't tested this patch at all, except that it fixes the testcase :)

Lame comment at the current end of the thread - it's not threading through the

I don't know why y'all keep using the word "lame".  On the contrary,
these are incredibly useful explanations.  Thanks.

latch but threading through the loop header that's problematic, at least if the
end of the threading path ends within the loop (threading through the header
to the loop exit is fine).  Because in that situation you effectively created an
alternate loop entry.  Threading through the latch into the loop header is
fine but with simple latches that likely will never happen (if there are no
simple latches then the latch can contain the loop exit test).

See tree-ssa-threadupdate.c:thread_block_1

         e2 = path->last ()->e;
         if (!e2 || noloop_only)
           {
             /* If NOLOOP_ONLY is true, we only allow threading through the
                header of a loop to exit edges.  */

             /* One case occurs when there was loop header buried in a jump
                threading path that crosses loop boundaries.  We do not try
                and thread this elsewhere, so just cancel the jump threading
                request by clearing the AUX field now.  */
             if (bb->loop_father != e2->src->loop_father
                 && (!loop_exit_edge_p (e2->src->loop_father, e2)
                     || flow_loop_nested_p (bb->loop_father,
                                            e2->dest->loop_father)))
               {
                 /* Since this case is not handled by our special code
                    to thread through a loop header, we must explicitly
                    cancel the threading request here.  */
                 delete_jump_thread_path (path);
                 e->aux = NULL;
                 continue;
               }

But this is for a threading path that crosses loop boundaries, which is
not the case.  Perhaps we should restrict this further to threads within
a loop?

there are a lot of "useful" checks in this function and the backwards threader
should adopt those.  Note the backwards threader originally only did
FSM style threadings which are exactly those possibly "harmful" ones, forming
irreducible regions at worst or sub-loops at best.  That might explain the
lack of those checks.

Also, the aforementioned checks are in jump_thread_path_registry, which
is also shared by the backward threader.  These are thread discards
_after_ a thread has been registered.

Yeah, that's indeed unfortunate.

   The backward threader should also
be using these restrictions.  Unless, I'm missing some interaction with
the FSM/etc threading types as per the preamble to the snippet you provided:

        if (((*path)[1]->type == EDGE_COPY_SRC_JOINER_BLOCK && !joiners)
            || ((*path)[1]->type == EDGE_COPY_SRC_BLOCK && joiners))
          continue;

Indeed.  But I understand the backwards threader does not (only) do FSM
threading now.

If it does, it was not part of my rewrite.  I was careful to not touch
anything dealing with either path profitability or low-level path
registering.

The path registering is in back_threader_registry::register_path().  We
only use EDGE_FSM_THREADs and then a final EDGE_NO_COPY.  ISTM that
those are only EDGE_FSM_THREADs??

Well, if the backwards threader classifies everything as FSM that's probably
inaccurate since only threads through the loop latch are "FSM".  There is
the comment

FSM is used all over the place, but it really just applies to threadingthrough the latch and through an indirect jump/switch. Many of theplaces that say FSM should probably just say "backwards" or somethingsimilar.


   /* If this path does not thread through the loop latch, then we are
      using the FSM threader to find old style jump threads.  This
      is good, except the FSM threader does not re-use an existing
      threading path to reduce code duplication.

      So for that case, drastically reduce the number of statements
      we are allowed to copy.  */

so these cases should use the "old style" validity/costing metrics and thus
classify threading opportunities in a different way?

It's not that simple. The backwards threader uses a different blockcopier and CFG update mechanism that does not support commonizingthreads from multiple paths to the same target. So each jump threadends up with a copy of the path which can lead to excessive code bloat.

THe forward threader knows how to commonize those paths so cut down onthe duplicates and thus can accept larger blocks without having such anegative impact on code size.


I think today "backwards" vs, "forwards" only refers to the way we find
threading opportunities.

Correct.

jeff

Re: More aggressive threading causing loop-interchange-9.c regression

Reply via email to