Re: s390 port

2021-09-07 Thread Joe Monk via Gcc
It is unclear how this would even work.

For instance, the LA instruction clears the top bit.

Also, instructions like LPR, LNR, BXLE, BXH all treat the value in the
register as signed, so the top bit is not available.

Joe


does aarch64-w64-mingw32 triplet exist??

2021-09-07 Thread sotrdg sotrdg via Gcc
I am curious whether binutils-gdb/gcc support arm architectures of mingw32.

Sent from Mail for Windows



More aggressive threading causing loop-interchange-9.c regression

2021-09-07 Thread Aldy Hernandez via Gcc

Hi folks.

I have a pending patch for the path solver that pulls in global ranges 
when available (the stuff in SSA_NAME_RANGE_INFO).  In doing so, I have 
run into a regression I was hoping the loop experts could pontificate on.


The regression comes from the simple_reduc_1() function in 
tree-ssa/loop-interchange-9.c, and it involves the following path:


=== BB 5 
Imports: n_10(D)  j_19
Exports: n_10(D)  j_13  j_19
 j_13 : j_19(I)
n_10(D) int [1, 257]
j_19int [0, 256]
Relational : (j_13 > j_19)
 [local count: 118111600]:
# sum_21 = PHI 
c[j_19] = sum_21;
j_13 = j_19 + 1;
if (n_10(D) > j_13)
  goto ; [89.00%]
else
  goto ; [11.00%]

j_13 : int [1, 257]
5->9  (T) n_10(D) :  int [2, 257]
5->9  (T) j_13 : int [1, 256]
5->9  (T) j_19 : int [0, 255]
5->6  (F) n_10(D) :  int [1, 257]
5->6  (F) j_13 : int [1, 257]
5->6  (F) j_19 : int [0, 256]

=== BB 9 
n_10(D) int [2, 257]
j_13int [1, 256]
Relational : (n_10(D) > j_19)
Relational : (n_10(D) > j_13)
 [local count: 105119324]:
goto ; [100.00%]


=== BB 3 
Imports: n_10(D)
Exports: n_10(D)
n_10(D) int [1, +INF]
 [local count: 118111600]:
# j_19 = PHI 
sum_11 = c[j_19];
if (n_10(D) > 0)
  goto ; [89.00%]
else
  goto ; [11.00%]

With knowledge from previous passes (SSA_NAME_RANGE_INFO), we know that 
the loop cannot legally iterate outside the size of c[256].  So, j_13 
lies within [1, 257] and n_10 is [2, +INF] at the end of the path.  This 
allows us to thread BB3 to BB8.  Doing so, causes the test to fail here:


/* { dg-final { scan-tree-dump-times "Loop_pair is 
interchanged" 1 "linterchange" } } */


All the blocks lie within the same loop, and the path passes all the 
tests in path_profitable_p().


Is there something about the shape of this path that should make it 
invalid in the backward threader, or should the test be marked with 
-fdisable-tree-threadN (or something else entirely)?  Note that the 
backward threader is allowed to peel through loop headers.


I am attaching the gimple dump as well as the graphviz output for easier 
analysis.


Thanks.
Aldy
__attribute__((noinline))
void simple_reduc_1 (int n)
{
  int i;
  int sum;
  int j;
  int _1;
  int _2;
  int _3;

   [local count: 14598063]:
  if (n_10(D) > 0)
goto ; [89.00%]
  else
goto ; [11.00%]

   [local count: 12992276]:

   [local count: 118111600]:
  # j_19 = PHI 
  sum_11 = c[j_19];
  if (n_10(D) > 0)
goto ; [89.00%]
  else
goto ; [11.00%]

   [local count: 105119324]:

   [local count: 955630225]:
  # sum_20 = PHI 
  # i_22 = PHI 
  _1 = a[i_22][j_19];
  _2 = b[i_22][j_19];
  _3 = _1 * _2;
  sum_14 = _3 + sum_20;
  i_15 = i_22 + 1;
  if (n_10(D) > i_15)
goto ; [89.00%]
  else
goto ; [11.00%]

   [local count: 850510901]:
  goto ; [100.00%]

   [local count: 118111600]:
  # sum_21 = PHI 
  c[j_19] = sum_21;
  j_13 = j_19 + 1;
  if (n_10(D) > j_13)
goto ; [89.00%]
  else
goto ; [11.00%]

   [local count: 105119324]:
  goto ; [100.00%]

   [local count: 14598063]:
  return;

}


graph.ps
Description: PostScript document


Re: More aggressive threading causing loop-interchange-9.c regression

2021-09-07 Thread Michael Matz via Gcc
Hello,

On Tue, 7 Sep 2021, Aldy Hernandez via Gcc wrote:

> The regression comes from the simple_reduc_1() function in
> tree-ssa/loop-interchange-9.c, and it involves the following path:
> 
> === BB 5 
> Imports: n_10(D)  j_19
> Exports: n_10(D)  j_13  j_19
>  j_13 : j_19(I)
> n_10(D)   int [1, 257]
> j_19  int [0, 256]
> Relational : (j_13 > j_19)
>  [local count: 118111600]:
> # sum_21 = PHI 
> c[j_19] = sum_21;
> j_13 = j_19 + 1;
> if (n_10(D) > j_13)
>   goto ; [89.00%]
> else
>   goto ; [11.00%]

So, this is the outer loops exit block ...

> === BB 9 
> n_10(D)   int [2, 257]
> j_13  int [1, 256]
> Relational : (n_10(D) > j_19)
> Relational : (n_10(D) > j_13)
>  [local count: 105119324]:
> goto ; [100.00%]

... this the outer loops latch block ...

> === BB 3 
> Imports: n_10(D)
> Exports: n_10(D)
> n_10(D)   int [1, +INF]
>  [local count: 118111600]:
> # j_19 = PHI 
> sum_11 = c[j_19];
> if (n_10(D) > 0)
>   goto ; [89.00%]
> else
>   goto ; [11.00%]

... and this the outer loops header, as well as inner loops pre-header.
The attempted thread hence involves a back-edge (of the outer loop) and a 
loop-entry edge (bb3->bb8) of the inner loop.  Doing that threading would 
destroy some properties that our loop optimizers rely on, e.g. that the 
loop header of a natural loop is only reached by two edges: entry edge and 
back edge, and that the latch blocks are empty and that there's only a 
single latch.  (What exactly we require depends on several flags in 
loops_state_satisfies_p).

> With knowledge from previous passes (SSA_NAME_RANGE_INFO), we know that 
> the loop cannot legally iterate outside the size of c[256].  So, j_13 
> lies within [1, 257] and n_10 is [2, +INF] at the end of the path.  
> This allows us to thread BB3 to BB8.

So, IIUC doing this threading would create a new entry to BB8: it would 
then be entered by its natural entry (bb3), by its natural back edge 
(whatever bb that is now) and the new entry from the threaded outer back 
edge (bb9 or bb5?).

The inner loop wouldn't then be recognized as natural anymore and the 
whole nest not as perfect, and hence loop interchange can't easily happen 
anymore.  Most other structural loop optimizers of us would have problems 
with that situation as well.

> All the blocks lie within the same loop, and the path passes all the 
> tests in path_profitable_p().
> 
> Is there something about the shape of this path that should make it 
> invalid in the backward threader, or should the test be marked with 
> -fdisable-tree-threadN (or something else entirely)?

This is a functionality test checking that the very necessary interchange 
in this test does happen with default plus -floop-interchange (with the 
intention of it being enabled by O3 or with profile info).  So no 
additional flags can be added without changing the meaning of this test.

> Note that the 
> backward threader is allowed to peel through loop headers.

Something needs to give way in the path threaders before loop 
optimizations: either threading through back edges, through loop latches 
or through loop headers needs to be disabled.  I think traditionally at 
least threading through latches should be disabled, because doing so 
usually destroys simple loop structure.  I see that profitable_path_p() of 
the backward threader wants to be careful in some situations involving 
loops and latches; possibly it's not careful enough yet for the additional 
power brought by ranger.

See also the additional tests tree-cfgcleanup.c:tree_forwarder_block_p is 
doing when loops are active.

After the structural loop optimizers the threader can go wild and thread 
everything it finds.


Ciao,
Michael.


Re: s390 port

2021-09-07 Thread Paul Edwards via Gcc
Hi Joe.

Thanks for your comments.

> It is unclear how this would even work. 

> For instance, the LA instruction clears the top bit.

In AM64, LA does not clear any bits.

> Also, instructions like LPR, LNR,

These operate on data registers, not addresses,
and will continue to work unchanged.

> BXLE, BXH all treat the value in the register as signed,
> so the top bit is not available.

These are already a problem if you are putting
addresses in them and it is approaching the 2 GiB
mark. The POP has a special mention of that.

Fun fact: The z/Arch POP has the same problem with
the G version of those instructions, when it hits the
63-bit mark, but the POP incorrectly states that the
problem occurs near the 64-bit mark. I reported the
problem with the POP but nothing seems to have
been done.

The solution is to drop these instructions from the
repertoire. C-generated assembler for both i370 and
s390 targets does not use these.

BFN. Paul.