Re: More aggressive threading causing loop-interchange-9.c regression

2021-09-08 Thread Aldy Hernandez via Gcc
First of all.  Thanks so much for this incredibly useful explanation. 
It helps a lot.


I'm still trying to wrap my head around this, so please bear with me.

On 9/7/21 4:45 PM, Michael Matz wrote:

Hello,

On Tue, 7 Sep 2021, Aldy Hernandez via Gcc wrote:


The regression comes from the simple_reduc_1() function in
tree-ssa/loop-interchange-9.c, and it involves the following path:

=== BB 5 
Imports: n_10(D)  j_19
Exports: n_10(D)  j_13  j_19
  j_13 : j_19(I)
n_10(D) int [1, 257]
j_19int [0, 256]
Relational : (j_13 > j_19)
  [local count: 118111600]:
 # sum_21 = PHI 
 c[j_19] = sum_21;
 j_13 = j_19 + 1;
 if (n_10(D) > j_13)
   goto ; [89.00%]
 else
   goto ; [11.00%]


So, this is the outer loops exit block ...


=== BB 9 
n_10(D) int [2, 257]
j_13int [1, 256]
Relational : (n_10(D) > j_19)
Relational : (n_10(D) > j_13)
  [local count: 105119324]:
 goto ; [100.00%]


... this the outer loops latch block ...


=== BB 3 
Imports: n_10(D)
Exports: n_10(D)
n_10(D) int [1, +INF]
  [local count: 118111600]:
 # j_19 = PHI 
 sum_11 = c[j_19];
 if (n_10(D) > 0)
   goto ; [89.00%]
 else
   goto ; [11.00%]


... and this the outer loops header, as well as inner loops pre-header.


Actually, the inner loop's pre-header seems to be BB8, which is empty, 
and leads into the BB4 which is the header of the inner loop.



The attempted thread hence involves a back-edge (of the outer loop) and a
loop-entry edge (bb3->bb8) of the inner loop.  Doing that threading would
destroy some properties that our loop optimizers rely on, e.g. that the
loop header of a natural loop is only reached by two edges: entry edge and
back edge, and that the latch blocks are empty and that there's only a
single latch.  (What exactly we require depends on several flags in
loops_state_satisfies_p).


But threading 3->8 doesn't involve a loop-entry edge.  It involves a new 
edge into the *pre-header* of the inner loop.


I am attaching the IL right after threading and before we clean up the 
CFG (il-after-threading.txt).


After threading we have have rewritten the 5->9 edge into 5->11:

   [local count: 118111600]:
  # sum_21 = PHI 
  c[j_19] = sum_21;
  j_13 = j_19 + 1;
  if (n_10(D) > j_13)
goto ; [89.00%]
  else
goto ; [11.00%]

   [local count: 105119324]:

   [local count: 105119324]:
  # j_7 = PHI 
  sum_6 = c[j_19];
  goto ; [100.00%]

   [local count: 0]:;; This becomes unreachable.
  goto ; [100.00%]

Notice BB9 becomes unreachable, so we're basically eliding the latch in BB9.

Question: could BB12 not be considered the loop latch now and BB8 be the 
outer loop's entry?  This would also mean, that BB3 is now outside of 
the outer loop, which means the threader peeled off the first iteration 
of the loop.  Or is it a requirement that the latch be empty?





With knowledge from previous passes (SSA_NAME_RANGE_INFO), we know that
the loop cannot legally iterate outside the size of c[256].  So, j_13
lies within [1, 257] and n_10 is [2, +INF] at the end of the path.
This allows us to thread BB3 to BB8.


So, IIUC doing this threading would create a new entry to BB8: it would
then be entered by its natural entry (bb3), by its natural back edge
(whatever bb that is now) and the new entry from the threaded outer back
edge (bb9 or bb5?).


As I mentioned, BB8 looks like the pre-header to the inner loop.  But 
yes, it now has multiple entries: the edge from BB3, and the back edge 
from BB12 (which seems like it could be a new latch, since it's the only 
back edge).




The inner loop wouldn't then be recognized as natural anymore and the
whole nest not as perfect, and hence loop interchange can't easily happen
anymore.  Most other structural loop optimizers of us would have problems
with that situation as well.


If I clean up the CFG and call loop_optimizer_init() again at this 
point, what is destroyed is the outer loop, not the inner loop.  If you 
look at the IL at this point 
(il-after-cleanup-and-rerunning-loop-init.txt), we have a latch in BB7 
going back up to BB4, though the conditional is now in BB6 (eeech):


  
...
...
...


   [local count: 118111600]:
  # sum_21 = PHI 
  # j_18 = PHI 
  c[j_18] = sum_21;
  j_13 = j_18 + 1;
  if (n_10(D) > j_13)
goto ; [89.00%]
  else
goto ; [11.00%]

   [local count: 105119324]:
  # j_7 = PHI 
  sum_6 = c[j_7];
  goto ; [100.00%]

Perhaps, this looks sufficiently different that the loop optimizer can't 
recognize it as a loop?





All the blocks lie within the same loop, and the path passes all the
tests in path_profitable_p().

Is there something about the shape of this path that should make it
invalid in the backward threader, or should the test be marked with
-fdisable-tree-threadN (or something else entirely)?


This is a functionality test checking that the very necessary interchange
in this test does happen with default plus -flo

Re: More aggressive threading causing loop-interchange-9.c regression

2021-09-08 Thread Richard Biener via Gcc
On Wed, Sep 8, 2021 at 12:44 PM Aldy Hernandez  wrote:
>
> First of all.  Thanks so much for this incredibly useful explanation.
> It helps a lot.
>
> I'm still trying to wrap my head around this, so please bear with me.
>
> On 9/7/21 4:45 PM, Michael Matz wrote:
> > Hello,
> >
> > On Tue, 7 Sep 2021, Aldy Hernandez via Gcc wrote:
> >
> >> The regression comes from the simple_reduc_1() function in
> >> tree-ssa/loop-interchange-9.c, and it involves the following path:
> >>
> >> === BB 5 
> >> Imports: n_10(D)  j_19
> >> Exports: n_10(D)  j_13  j_19
> >>   j_13 : j_19(I)
> >> n_10(D)  int [1, 257]
> >> j_19 int [0, 256]
> >> Relational : (j_13 > j_19)
> >>   [local count: 118111600]:
> >>  # sum_21 = PHI 
> >>  c[j_19] = sum_21;
> >>  j_13 = j_19 + 1;
> >>  if (n_10(D) > j_13)
> >>goto ; [89.00%]
> >>  else
> >>goto ; [11.00%]
> >
> > So, this is the outer loops exit block ...
> >
> >> === BB 9 
> >> n_10(D)  int [2, 257]
> >> j_13 int [1, 256]
> >> Relational : (n_10(D) > j_19)
> >> Relational : (n_10(D) > j_13)
> >>   [local count: 105119324]:
> >>  goto ; [100.00%]
> >
> > ... this the outer loops latch block ...
> >
> >> === BB 3 
> >> Imports: n_10(D)
> >> Exports: n_10(D)
> >> n_10(D)  int [1, +INF]
> >>   [local count: 118111600]:
> >>  # j_19 = PHI 
> >>  sum_11 = c[j_19];
> >>  if (n_10(D) > 0)
> >>goto ; [89.00%]
> >>  else
> >>goto ; [11.00%]
> >
> > ... and this the outer loops header, as well as inner loops pre-header.
>
> Actually, the inner loop's pre-header seems to be BB8, which is empty,
> and leads into the BB4 which is the header of the inner loop.
>
> > The attempted thread hence involves a back-edge (of the outer loop) and a
> > loop-entry edge (bb3->bb8) of the inner loop.  Doing that threading would
> > destroy some properties that our loop optimizers rely on, e.g. that the
> > loop header of a natural loop is only reached by two edges: entry edge and
> > back edge, and that the latch blocks are empty and that there's only a
> > single latch.  (What exactly we require depends on several flags in
> > loops_state_satisfies_p).
>
> But threading 3->8 doesn't involve a loop-entry edge.  It involves a new
> edge into the *pre-header* of the inner loop.
>
> I am attaching the IL right after threading and before we clean up the
> CFG (il-after-threading.txt).
>
> After threading we have have rewritten the 5->9 edge into 5->11:
>
> [local count: 118111600]:
># sum_21 = PHI 
>c[j_19] = sum_21;
>j_13 = j_19 + 1;
>if (n_10(D) > j_13)
>  goto ; [89.00%]
>else
>  goto ; [11.00%]
>
> [local count: 105119324]:
>
> [local count: 105119324]:
># j_7 = PHI 
>sum_6 = c[j_19];
>goto ; [100.00%]
>
> [local count: 0]: ;; This becomes unreachable.
>goto ; [100.00%]
>
> Notice BB9 becomes unreachable, so we're basically eliding the latch in BB9.
>
> Question: could BB12 not be considered the loop latch now and BB8 be the
> outer loop's entry?  This would also mean, that BB3 is now outside of
> the outer loop, which means the threader peeled off the first iteration
> of the loop.  Or is it a requirement that the latch be empty?
>
> >
> >> With knowledge from previous passes (SSA_NAME_RANGE_INFO), we know that
> >> the loop cannot legally iterate outside the size of c[256].  So, j_13
> >> lies within [1, 257] and n_10 is [2, +INF] at the end of the path.
> >> This allows us to thread BB3 to BB8.
> >
> > So, IIUC doing this threading would create a new entry to BB8: it would
> > then be entered by its natural entry (bb3), by its natural back edge
> > (whatever bb that is now) and the new entry from the threaded outer back
> > edge (bb9 or bb5?).
>
> As I mentioned, BB8 looks like the pre-header to the inner loop.  But
> yes, it now has multiple entries: the edge from BB3, and the back edge
> from BB12 (which seems like it could be a new latch, since it's the only
> back edge).

It would be helpful to have the patch causing the issue to look at the IL.
But as Micha said, there needs to be a perfect loop nest for interchange
to work.

Richard.

> >
> > The inner loop wouldn't then be recognized as natural anymore and the
> > whole nest not as perfect, and hence loop interchange can't easily happen
> > anymore.  Most other structural loop optimizers of us would have problems
> > with that situation as well.
>
> If I clean up the CFG and call loop_optimizer_init() again at this
> point, what is destroyed is the outer loop, not the inner loop.  If you
> look at the IL at this point
> (il-after-cleanup-and-rerunning-loop-init.txt), we have a latch in BB7
> going back up to BB4, though the conditional is now in BB6 (eeech):
>
>
> ...
> ...
> ...
>
>
> [local count: 118111600]:
># sum_21 = PHI 
># j_18 = PHI 
>c[j_18] = sum_21;
>j_13 = j_18 + 1;
>if (n_10(D) > j_13)
>

Re: More aggressive threading causing loop-interchange-9.c regression

2021-09-08 Thread Aldy Hernandez via Gcc

It would be helpful to have the patch causing the issue to look at the IL.
But as Micha said, there needs to be a perfect loop nest for interchange
to work.

Richard.


Absolutely!  I'm attaching the reduced testcase, as well as the patch.

The problematic thread shows up in the thread2 dump:

Checking profitability of path (backwards):  bb:3 (4 insns) bb:9 (0 
insns) bb:5

  Control statement insns: 2
  Overall: 2 insns
  Registering FSM jump thread: (5, 9) incoming edge;  (9, 3)  (3, 8) 
nocopy; (3, 8)


Thanks.
Aldy
commit 1bf3f76a5ff075396b5b9f5f88d6b18649dac2ce
Author: Aldy Hernandez 
Date:   Sun Sep 5 16:54:00 2021 +0200

Take into account global ranges when folding statements in path solver.

The path solver used by the backwards threader can refine a folded
range by taking into account any global range known.  This patch
intersects the folded range with whatever global information we have.

gcc/ChangeLog:

* gimple-range-path.cc (path_range_query::internal_range_of_expr):
Intersect with global ranges.

> FAIL: gcc.dg/tree-ssa/loop-interchange-9.c scan-tree-dump-times linterchange "Loop_pair is interchanged" 1
> FAIL: gcc.dg/tree-ssa/cunroll-15.c scan-tree-dump optimized "return 1;"

diff --git a/gcc/gimple-range-path.cc b/gcc/gimple-range-path.cc
index a4fa3b296ff..c616b65756f 100644
--- a/gcc/gimple-range-path.cc
+++ b/gcc/gimple-range-path.cc
@@ -127,6 +127,9 @@ path_range_query::internal_range_of_expr (irange &r, tree name, gimple *stmt)
   basic_block bb = stmt ? gimple_bb (stmt) : exit_bb ();
   if (stmt && range_defined_in_block (r, name, bb))
 {
+  if (TREE_CODE (name) == SSA_NAME)
+	r.intersect (gimple_range_global (name));
+
   set_cache (r, name);
   return true;
 }
/* { dg-do run } */
/* { dg-options "-O2 -floop-interchange -fdump-tree-linterchange-details" } */
/* { dg-require-effective-target size20plus } */
/* { dg-skip-if "too big data segment" { visium-*-* } } */

#define M 256
int a[M][M], b[M][M], c[M], d[M];
void __attribute__((noinline))
simple_reduc_1 (int n)
{
  for (int j = 0; j < n; j++)
{
  int sum = c[j];
  for (int i = 0; i < n; i++)
	sum = sum + a[i][j]*b[i][j];

  c[j] = sum;
}
}


Re: More aggressive threading causing loop-interchange-9.c regression

2021-09-08 Thread Richard Biener via Gcc
On Wed, Sep 8, 2021 at 3:25 PM Aldy Hernandez  wrote:
>
> > It would be helpful to have the patch causing the issue to look at the IL.
> > But as Micha said, there needs to be a perfect loop nest for interchange
> > to work.
> >
> > Richard.
>
> Absolutely!  I'm attaching the reduced testcase, as well as the patch.
>
> The problematic thread shows up in the thread2 dump:
>
> Checking profitability of path (backwards):  bb:3 (4 insns) bb:9 (0
> insns) bb:5
>Control statement insns: 2
>Overall: 2 insns
>Registering FSM jump thread: (5, 9) incoming edge;  (9, 3)  (3, 8)
> nocopy; (3, 8)

Well, so as Micha said, the threading destroys the outer loop by
making it have two entries - we actually _do_ recover from this
but it results in a non-empty latch of the outer loop and thus
the loop is no longer a perfect nest.  The interchange pass has
special-sauce to recover from the loop store motion applied
but not to handle the kind of loop rotation the jump threading
caused.

The forward threader guards against this by simply disallowing
threadings that involve different loops.  As I see
the threading done here should be 7->3 (outer loop entry) to bb 8
rather than one involving the backedge.  Also note the condition
is invariant in the loop and in fact subsumed by the condition
outside of the loop and it should have been simplified by
VRP after pass_ch but I see there's a jump threading pass
inbetween pass_ch and the next VRP which is likely the
problem.

Why does jump threading not try to simplify a condition before
attempting to thread through it?  After all ranger should be around?

Richard.

> Thanks.
> Aldy


[RISCV] RISC-V GNU Toolchain Biweekly Sync-up call (Sept 9, 2021)

2021-09-08 Thread 吴伟
Hi all,

There is an agenda for tomorrow's meeting. If you have topics to
discuss or share, please let me know and I can add them to the agenda.


Agenda:

1. Progress of RISC-V subextensions toolchain developing
2. Open Discussion

Topic: RISC-V GNU Toolchain Biweekly Sync-up

Join Zoom Meeting
https://zoom.com.cn/j/89393600951?pwd=ZFpWMkZ6Tm1TbUFXT1hZZjZZMHhRQT09

Meeting ID: 893 9360 0951
Passcode: 899662


BEIJING, China
11:00pThu, Sep 9 2021
12:00aFri, Sep 10 2021

PDT/PST, Pacific Daylight Time (US)
8:00aThu, Sep 9 2021
9:00aThu, Sep 9 2021

PHILADELPHIA, United States, Pennsylvania
11:00aThu, Sep 9 2021
12:00pThu, Sep 9 2021

PARIS, France
5:00pThu, Sep 9 2021
6:00pThu, Sep 9 2021






-- 
Best wishes,
Wei Wu (吴伟)


Re: More aggressive threading causing loop-interchange-9.c regression

2021-09-08 Thread Aldy Hernandez via Gcc

On 9/8/21 3:49 PM, Richard Biener wrote:


On Wed, Sep 8, 2021 at 3:25 PM Aldy Hernandez  wrote:



It would be helpful to have the patch causing the issue to look at the IL.
But as Micha said, there needs to be a perfect loop nest for interchange
to work.

Richard.


Absolutely!  I'm attaching the reduced testcase, as well as the patch.

The problematic thread shows up in the thread2 dump:

Checking profitability of path (backwards):  bb:3 (4 insns) bb:9 (0
insns) bb:5
Control statement insns: 2
Overall: 2 insns
Registering FSM jump thread: (5, 9) incoming edge;  (9, 3)  (3, 8)
nocopy; (3, 8)


Well, so as Micha said, the threading destroys the outer loop by
making it have two entries - we actually _do_ recover from this
but it results in a non-empty latch of the outer loop and thus
the loop is no longer a perfect nest.  The interchange pass has
special-sauce to recover from the loop store motion applied
but not to handle the kind of loop rotation the jump threading
caused.


Understood.  The backedge into BB8 and the non-empty latch seem to cause 
problems.




The forward threader guards against this by simply disallowing
threadings that involve different loops.  As I see


The thread in question (5->9->3) is all within the same outer loop, 
though.  BTW, the backward threader also disallows threading across 
different loops (see path_crosses_loops variable).



the threading done here should be 7->3 (outer loop entry) to bb 8
rather than one involving the backedge.  Also note the condition
is invariant in the loop and in fact subsumed by the condition
outside of the loop and it should have been simplified by
VRP after pass_ch but I see there's a jump threading pass
inbetween pass_ch and the next VRP which is likely the
problem.


A 7->3->8 thread would cross loops though, because 7 is outside the 
outer loop.




Why does jump threading not try to simplify a condition before
attempting to thread through it?  After all ranger should be around?


That's a very good question ;-).  The current code does not use the 
ranger to solve unknowns.  Anything further back than the first block in 
a path will return varying.  The plan was to add this ranger solving 
functionality as a follow-up.


I have a whole slew of pending patches adding precisely this 
functionality.  We should be able to solve ranges outside the path with 
ranger, as well as relationals occurring in a path.


However, even if there are alternate ways of threading this IL, 
something like 5->9->3 could still happen.  We need a way to disallow 
this.  I'm having a hard time determining the hammer for this.  I would 
vote for disabling threading through latches, but it seems the backward 
threader is aware of this scenario and allows it anyhow (see 
threaded_through_latch variable).  Ughh.


Thanks for looking into this.
Aldy



Re: More aggressive threading causing loop-interchange-9.c regression

2021-09-08 Thread Michael Matz via Gcc
Hello,

On Wed, 8 Sep 2021, Aldy Hernandez wrote:

> > The forward threader guards against this by simply disallowing 
> > threadings that involve different loops.  As I see
> 
> The thread in question (5->9->3) is all within the same outer loop, 
> though. BTW, the backward threader also disallows threading across 
> different loops (see path_crosses_loops variable).
> 
> > the threading done here should be 7->3 (outer loop entry) to bb 8 
> > rather than one involving the backedge.  Also note the condition is 
> > invariant in the loop and in fact subsumed by the condition outside of 
> > the loop and it should have been simplified by VRP after pass_ch but I 
> > see there's a jump threading pass inbetween pass_ch and the next VRP 
> > which is likely the problem.
> 
> A 7->3->8 thread would cross loops though, because 7 is outside the 
> outer loop.

...
 
> However, even if there are alternate ways of threading this IL, 
> something like 5->9->3 could still happen.  We need a way to disallow 
> this.  I'm having a hard time determining the hammer for this.  I would 
> vote for disabling threading through latches, but it seems the backward 
> threader is aware of this scenario and allows it anyhow (see 
> threaded_through_latch variable).  Ughh.

The backward threader seems to want to be careful with latches, but still 
allow it in some situations, in particular when doing so doesn't create a 
loop with non-simple latches (which is basically a single and empty latch 
block).  If this improvement under discussion leads to creating a 
non-empty latch then those checks aren't restrictive enough (anymore).

I think threading through a latch is always dubious regarding the loop 
structure, it's basically either loop rotation or iteration peeling, even 
if it doesn't cause non-simple latches.  Those transformations should 
probably be left to a loop optimizer, or be only done when destroying loop 
structure is fine (i.e. late).

Maybe it's possible to not disable threading over latches alltogether in 
the backward threader (like it's tried now), but I haven't looked at the 
specific situation here in depth, so take my view only as opinion from a 
large distance :-)

Does anything break if you brutally disable any backward threading when 
any of the involved blocks is a latch when current_loops is set?  (I guess 
for that latter test to be effective you want to disable the 
loop_optimizer_init() for the "late" jump thread passes)


Ciao,
Michael.


Re: More aggressive threading causing loop-interchange-9.c regression

2021-09-08 Thread Michael Matz via Gcc
Hello,

[lame answer to self]

On Wed, 8 Sep 2021, Michael Matz wrote:

> > > The forward threader guards against this by simply disallowing 
> > > threadings that involve different loops.  As I see
> > 
> > The thread in question (5->9->3) is all within the same outer loop, 
> > though. BTW, the backward threader also disallows threading across 
> > different loops (see path_crosses_loops variable).
...
> Maybe it's possible to not disable threading over latches alltogether in 
> the backward threader (like it's tried now), but I haven't looked at the 
> specific situation here in depth, so take my view only as opinion from a 
> large distance :-)

I've now looked at the concrete situation.  So yeah, the whole path is in 
the same loop, crosses the latch, _and there's code following the latch 
on that path_.  (I.e. the latch isn't the last block in the path).  In 
particular, after loop_optimizer_init() (before any threading) we have:

   [local count: 118111600]:
  # j_19 = PHI 
  sum_11 = c[j_19];
  if (n_10(D) > 0)
goto ; [89.00%]
  else
goto ; [11.00%]

  [local count: 105119324]:
...

   [local count: 118111600]:
  # sum_21 = PHI 
  c[j_19] = sum_21;
  j_13 = j_19 + 1;
  if (n_10(D) > j_13)
goto ; [89.00%]
  else
goto ; [11.00%]

   [local count: 105119324]:
  goto ; [100.00%]

With bb9 the outer (empty) latch, bb3 the outer header, and bb8 the 
pre-header of inner loop, but more importantly something that's not at the 
start of the outer loop.

Now, any thread that includes the backedge 9->3 _including_ its 
destination (i.e. where the backedge isn't the last to-be-redirected edge) 
necessarily duplicates all code from that destination onto the back edge.  
Here it's the load from c[j] into sum_11.

The important part is the code is emitted onto the back edge, 
conceptually; in reality it's simply included into the (new) latch block 
(the duplicate of bb9, which is bb12 intermediately, then named bb7 after 
cfg_cleanup).

That's what we can't have for some of our structural loop optimizers: 
there must be no code executed after the exit test (e.g. in the latch 
block).  (This requirement makes reasoning about which code is or isn't 
executed completely for an iteration trivial; simply everything in the 
body is always executed; e.g. loop interchange uses this to check that 
there are no memory references after the exit test, because those would 
then be only conditional and hence make loop interchange very awkward).

Note that this situation can't be later rectified anymore: the duplicated 
instructions (because they are memory refs) must remain after the exit 
test.  Only by rerolling/unrotating the loop (i.e. noticing that the 
memory refs on the loop-entry path and on the back edge are equivalent) 
would that be possible, but that's something we aren't capable of.  Even 
if we were that would simply just revert the whole work that the threader 
did, so it's better to not even do that to start with.

I believe something like below would be appropriate, it disables threading 
if the path contains a latch at the non-last position (due to being 
backwards on the non-first position in the array).  I.e. it disables 
rotating the loop if there's danger of polluting the back edge.  It might 
be improved if the blocks following (preceding!) the latch are themself 
empty because then no code is duplicated.  It might also be improved if 
the latch is already non-empty.  That code should probably only be active 
before the loop optimizers, but currently the backward threader isn't 
differentiating between before/after loop-optims.

I haven't tested this patch at all, except that it fixes the testcase :)


Ciao,
Michael.

diff --git a/gcc/tree-ssa-threadbackward.c b/gcc/tree-ssa-threadbackward.c
index 449232c7715..528a753b886 100644
--- a/gcc/tree-ssa-threadbackward.c
+++ b/gcc/tree-ssa-threadbackward.c
@@ -600,6 +600,7 @@ back_threader_profitability::profitable_path_p (const 
vec &m_path,
   loop_p loop = m_path[0]->loop_father;
   bool path_crosses_loops = false;
   bool threaded_through_latch = false;
+  bool latch_within_path = false;
   bool multiway_branch_in_path = false;
   bool threaded_multiway_branch = false;
   bool contains_hot_bb = false;
@@ -725,7 +726,13 @@ back_threader_profitability::profitable_path_p (const 
vec &m_path,
 the last entry in the array when determining if we thread
 through the loop latch.  */
   if (loop->latch == bb)
-   threaded_through_latch = true;
+   {
+ threaded_through_latch = true;
+ if (j != 0)
+   latch_within_path = true;
+ if (dump_file && (dump_flags & TDF_DETAILS))
+   fprintf (dump_file, " (latch)");
+   }
 }
 
   gimple *stmt = get_gimple_control_stmt (m_path[0]);
@@ -845,6 +852,15 @@ back_threader_profitability::profitable_path_p (const 
vec &m_path,
 "a multiway branch.\n");
   return false;
 }
+
+  if (latch_within_path)
+{
+  if 

Some questions on hash_set/hash_map traits

2021-09-08 Thread Erick Ochoa via Gcc
Hello,

I am refactoring some old code that runs as an IPA_PASS. This code is
intended to run at link-time using the LTO framework and so I am
always testing it that way. At the moment, I am trying to use
hash-maps but having some difficulties. I am adding some patches that
will help illustrate along the way.

The first patch simply adds a link-time optimization pass that will be
used to demo the problem that I am having. One can run with -flto
-fipa-hash. During the first patch, the pass does not do any
meaningful work but it just helps to set up the stage.

For the second patch I create a wrapper around a symtab_node. This
wrapper is intended to add some other functionality to elements that I
might want to insert in hash-sets or hash-maps.

class symtab_wrapper
{
private:
  symtab_node *_symtab_node;
public:
  friend symtab_wrapper_traits;
  symtab_wrapper ()
: _symtab_node (NULL)
  {}
  symtab_wrapper (symtab_node *s)
: _symtab_node (s)
  { gcc_assert (s); }

  void print (void)
  {
if (!dump_file) return;
gcc_assert (_symtab_node);
fprintf (dump_file, "%s\n", _symtab_node->name ());
  }
};

I also have a wrapper around a hash-set to add some things that might
be of interest to me in the future:

template  >
class my_set_t
{
private:
  hash_set  _impl;
public:
  my_set_t () {}

  void insert (const KeyId &k)
  {
_impl.add (k);
  }

  bool contains (const KeyId &k)
  {
return _impl.contains (k);
  }

  void print (void)
  {
if (!dump_file) return;
for (auto i = _impl.begin (), e = _impl.end (); i != e; ++i)
{
  (*i).print ();
}
  }
};

I also created a symtab_wrapper_traits because I want to store the
object in the hash-map itself, and so I need to specify how to hash
it.

class symtab_wrapper_traits
{
public:
  typedef symtab_wrapper value_type;
  typedef symtab_wrapper compare_type;
  static const bool empty_zero_p = true;

  static inline hashval_t hash (const value_type &v)
  {
return pointer_hash ::hash (v._symtab_node);
  }

  static bool equal (const value_type &l, const value_type &r)
  {
return l._symtab_node == r._symtab_node;
  }

  static void mark_deleted (value_type &v)
  {
v._symtab_node = NULL;
  }

  static void mark_empty (value_type &v)
  {
v._symtab_node = NULL;
  }

  static void remove (value_type &v)
  {
v._symtab_node = NULL;
  }

  static bool is_deleted (const value_type &v)
  {
 return !v._symtab_node;
  }

  static bool is_empty (const value_type &v)
  {
 return !v._symtab_node;
  }
};

I then create a global set with my traits and populate it with some
information and later print. It seems to be working:

my_set_t  my_set;

 static void
 ipa_hash_generate_summary (void)
 {
  cgraph_node *cnode = NULL;
  FOR_EACH_FUNCTION (cnode)
  {
symtab_wrapper w (cnode);
my_set.insert (w);
  }

  my_set.print ();

  return;
 }

Here, I already have some questions, but this is not the biggest issue
that I am having: I believe that the hash_set implementation manages
its own memory, but I am unclear about the semantics of "removed".

* Is "removed" supposed to, for example, call the destructor? Since,
in this particular case, it is only a wrapper and cgraph_node is not
intended to be destroyed, I just assign the pointer to NULL. Not sure
if that's the intention.
* I don't understand the semantics of empty_zero_p, I think it is used
to zero out deleted entries? If I mark something as empty.
* I am assuming that its memory will be re-used, is this the case?
* I see that there is a typed_delete_remove function that is sometimes
used in traits, but I think that it doesn't apply to this case because
I am storing the whole object. Is that the case?

I later did some refactoring (patch 3), which did not seem to impact
the code now, but will impact hash_map later. The refactoring is as
follows, I create an abstract "printable" class

class printable
{
public:
  virtual char* str () = 0;
  void print (void);
};

void
printable::print (void)
{
  if (!dump_file) return;
  char* _str = str ();
  fprintf (dump_file, "%s\n", _str);
  free (_str);
}

Make symtab_wrapper a publically derived class

-class symtab_wrapper
+class symtab_wrapper : public printable

and implemented the str function in symtab_wrapper:

  virtual char* str (void)
   {
if (!dump_file) return;
gcc_assert (_symtab_node);
char *retval = NULL;
asprintf (&retval, "%s", _symtab_node->name ());
return retval;
  }

Again, this seems to be working correctly.

What is more interesting is moving from a hash-set to a hash-map. On
the fourth patch, I implemented the hash_map equivalent to the
hash_set that I am describing above:

template 
class my_map_t
{
private:
  hash_map  _impl;
public:
  my_map_t () {}

  void insert (const KeyId &k, const Value &v)
  {
_impl.put (k, v);
  }

  Value *select (const KeyId &k)
  {
return _impl.get (k);
  }

  void print (void)
  {
if (!dump_file) return;
for (auto i = _impl.begin (), e = 

Re: More aggressive threading causing loop-interchange-9.c regression

2021-09-08 Thread Richard Biener via Gcc
On Wed, Sep 8, 2021 at 8:13 PM Michael Matz  wrote:
>
> Hello,
>
> [lame answer to self]
>
> On Wed, 8 Sep 2021, Michael Matz wrote:
>
> > > > The forward threader guards against this by simply disallowing
> > > > threadings that involve different loops.  As I see
> > >
> > > The thread in question (5->9->3) is all within the same outer loop,
> > > though. BTW, the backward threader also disallows threading across
> > > different loops (see path_crosses_loops variable).
> ...
> > Maybe it's possible to not disable threading over latches alltogether in
> > the backward threader (like it's tried now), but I haven't looked at the
> > specific situation here in depth, so take my view only as opinion from a
> > large distance :-)
>
> I've now looked at the concrete situation.  So yeah, the whole path is in
> the same loop, crosses the latch, _and there's code following the latch
> on that path_.  (I.e. the latch isn't the last block in the path).  In
> particular, after loop_optimizer_init() (before any threading) we have:
>
>[local count: 118111600]:
>   # j_19 = PHI 
>   sum_11 = c[j_19];
>   if (n_10(D) > 0)
> goto ; [89.00%]
>   else
> goto ; [11.00%]
>
>   [local count: 105119324]:
> ...
>
>[local count: 118111600]:
>   # sum_21 = PHI 
>   c[j_19] = sum_21;
>   j_13 = j_19 + 1;
>   if (n_10(D) > j_13)
> goto ; [89.00%]
>   else
> goto ; [11.00%]
>
>[local count: 105119324]:
>   goto ; [100.00%]
>
> With bb9 the outer (empty) latch, bb3 the outer header, and bb8 the
> pre-header of inner loop, but more importantly something that's not at the
> start of the outer loop.
>
> Now, any thread that includes the backedge 9->3 _including_ its
> destination (i.e. where the backedge isn't the last to-be-redirected edge)
> necessarily duplicates all code from that destination onto the back edge.
> Here it's the load from c[j] into sum_11.
>
> The important part is the code is emitted onto the back edge,
> conceptually; in reality it's simply included into the (new) latch block
> (the duplicate of bb9, which is bb12 intermediately, then named bb7 after
> cfg_cleanup).
>
> That's what we can't have for some of our structural loop optimizers:
> there must be no code executed after the exit test (e.g. in the latch
> block).  (This requirement makes reasoning about which code is or isn't
> executed completely for an iteration trivial; simply everything in the
> body is always executed; e.g. loop interchange uses this to check that
> there are no memory references after the exit test, because those would
> then be only conditional and hence make loop interchange very awkward).
>
> Note that this situation can't be later rectified anymore: the duplicated
> instructions (because they are memory refs) must remain after the exit
> test.  Only by rerolling/unrotating the loop (i.e. noticing that the
> memory refs on the loop-entry path and on the back edge are equivalent)
> would that be possible, but that's something we aren't capable of.  Even
> if we were that would simply just revert the whole work that the threader
> did, so it's better to not even do that to start with.
>
> I believe something like below would be appropriate, it disables threading
> if the path contains a latch at the non-last position (due to being
> backwards on the non-first position in the array).  I.e. it disables
> rotating the loop if there's danger of polluting the back edge.  It might
> be improved if the blocks following (preceding!) the latch are themself
> empty because then no code is duplicated.  It might also be improved if
> the latch is already non-empty.  That code should probably only be active
> before the loop optimizers, but currently the backward threader isn't
> differentiating between before/after loop-optims.
>
> I haven't tested this patch at all, except that it fixes the testcase :)

Lame comment at the current end of the thread - it's not threading through the
latch but threading through the loop header that's problematic, at least if the
end of the threading path ends within the loop (threading through the header
to the loop exit is fine).  Because in that situation you effectively created an
alternate loop entry.  Threading through the latch into the loop header is
fine but with simple latches that likely will never happen (if there are no
simple latches then the latch can contain the loop exit test).

See tree-ssa-threadupdate.c:thread_block_1

  e2 = path->last ()->e;
  if (!e2 || noloop_only)
{
  /* If NOLOOP_ONLY is true, we only allow threading through the
 header of a loop to exit edges.  */

  /* One case occurs when there was loop header buried in a jump
 threading path that crosses loop boundaries.  We do not try
 and thread this elsewhere, so just cancel the jump threading
 request by clearing the AUX field now.  */
  if (bb->loop_father != e2->src->loop_father
  && (!loop_exit_edge_p (