[Bug rtl-optimization/93974] [10 Regression] ICE in decompose_normal_address, at rtlanal.c:6403 on powerpc64le-linux-gnu since r10-6762

2020-03-06 Thread rsandifo at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93974

--- Comment #15 from rsandifo at gcc dot gnu.org  
---
(In reply to Peter Bergner from comment #14)
> (In reply to Vladimir Makarov from comment #13)
> > Sorry, I have no good knowledge of decompose_address.  The original author
> > is Richard Sandiford and it is even not in RA sources (it is in rtlanal.c).
> > 
> > I'd recommend to ask Richard may be he has some insights or just modify to
> > make it work in this case.  I can only say it is a sensitive code which can
> > affect a lot of targets if something is not done right. We had such problems
> > with this code in the past.
> 
> Ah, ok, CCing Richard.
> 
> Richard, can you comment on the findings in Comment #4?  Do we just need to
> add support for seeing a (plus (and: ) (const int 8)) type of address or
> should we have never created anything like that in the first place or ???

Sorry for the slow reply, been a bit of a hectic week.

I think both fixes would be valid.  Like you say, the address
parsing code isn't yet ready to handle addresses that apply
an offset *before* the address "mutations".  That's because
no target has yet wanted to support such an address.  So as
things stand, the current address is not valid and shouldn't
have been created.

In some ways this is similar to creating an invalid highpart
subreg for the upper word of a doubleword vector hard register,
or a subreg that falls foul of some simplify_subreg_regno rule.
The RA has code to avoid doing that, see init_subregs_of_mode
and its users.  We could do something similar here for
REG_EQUIV MEMs.  One option would be to key off whether
strip_address_mutations is a no-op on the address.
Another would be to check whether each required sub-MEM
is valid for its mode and offset.  The latter would be
more elaborate but might produce better code in general,
not just for cases like this.

Like you say in comment 4, even the zero-offset half isn't
actually a valid address for PowerPC, so either of the two
options should give better code as well as fixing the bug.

On the other hand, the idea was always that address_info
and its users could be extended if new targets have new
requirements.  So if we want to make this operation valid
then that would be OK too.

[Bug target/94072] [10 Regression] ICE: SIGSEGV due to infinite recursion in expand_expr/expand_expr_real_1 with -msve-vector-bits=512 since r10-4197

2020-03-17 Thread rsandifo at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94072

rsandifo at gcc dot gnu.org  changed:

   What|Removed |Added

 Status|NEW |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |rsandifo at gcc dot 
gnu.org

--- Comment #2 from rsandifo at gcc dot gnu.org  
---
Mine.

[Bug target/94201] aarch64:ICE in tiny code model for ilp32

2020-03-19 Thread rsandifo at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94201

rsandifo at gcc dot gnu.org  changed:

   What|Removed |Added

 CC||rsandifo at gcc dot gnu.org
 Resolution|--- |FIXED
  Component|c   |target
 Status|NEW |RESOLVED

--- Comment #3 from rsandifo at gcc dot gnu.org  
---
Fixed on master.

[Bug target/94072] [10 Regression] ICE: SIGSEGV due to infinite recursion in expand_expr/expand_expr_real_1 with -msve-vector-bits=512 since r10-4197

2020-03-20 Thread rsandifo at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94072

rsandifo at gcc dot gnu.org  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #4 from rsandifo at gcc dot gnu.org  
---
Fixed on master.

[Bug tree-optimization/94261] New: [10 Regression] ICE in vect_get_vec_def_for_operand_1 for 3-element condition reduction

2020-03-22 Thread rsandifo at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94261

Bug ID: 94261
   Summary: [10 Regression] ICE in vect_get_vec_def_for_operand_1
for 3-element condition reduction
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Keywords: ice-on-valid-code
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: rsandifo at gcc dot gnu.org
CC: rguenth at gcc dot gnu.org
  Target Milestone: ---
Target: aarch64*-*-*

The following test ICEs when compiled with
-O3 -ffinite-math-only -march=armv8.2-a+sve
on aarch64:

void
f (float *x, float *y, float z)
{
  float res0 = 0, res1 = 1, res2 = 2;
  for (int i = 0; i < 100; ++i)
{
  res0 = 100 <= y[i * 3] ? res0 : z;
  res1 = 101 <= y[i * 3 + 1] ? res1 : z;
  res2 = y[i * 3 + 2] < 102 ? z : res2;
}
  x[0] = res0;
  x[1] = res1;
  x[2] = res2;
}

The ICE is:

b.c:2:1: internal compiler error: in vect_get_vec_def_for_operand_1, at
tree-vect-stmts.c:1555
2 | f (float *x, float *y, float z)
  | ^
0x10a86c3 vect_get_vec_def_for_operand_1(_stmt_vec_info*, vect_def_type)
gcc/tree-vect-stmts.c:1555
0x10af8e1 vect_get_vec_def_for_operand(tree_node*, _stmt_vec_info*, tree_node*)
gcc/tree-vect-stmts.c:1617
0x10b258f vectorizable_condition
gcc/tree-vect-stmts.c:10213
0x10d01ea vect_transform_stmt(_stmt_vec_info*, gimple_stmt_iterator*,
_slp_tree*, _slp_instance*)
gcc/tree-vect-stmts.c:11012
0x10d4226 vect_transform_loop_stmt
gcc/tree-vect-loop.c:8307
0x10ec31a vect_transform_loop(_loop_vec_info*, gimple*)
gcc/tree-vect-loop.c:8707
0x110f797 try_vectorize_loop_1
gcc/tree-vectorizer.c:990
0x1110319 vectorize_loops()
gcc/tree-vectorizer.c:1126

The problem comes from trying both SVE and Advanced SIMD and
eventually picking SVE.  The SVE version can't treat the loop
as a single 3-element SLP reduction because we don't yet
support loading { 100, 101, 102 } repeating for variable-length
vectors.  We therefore fail the SLP build before doing anything
about the awkward comparison arrangement.  With this version,
each ?: is a separate reduction and each one has its own,
correct, STMT_VINFO_REDUC_IDX.

But then we try the Advanced SIMD version.  This doesn't have
a problem with loading the constants, so we get as far as
dealing with mismatched comparisons:

  /* Swap.  */
  if (*swap == 1)
{
  swap_ssa_operands (stmt, &TREE_OPERAND (cond, 0),
 &TREE_OPERAND (cond, 1));
  TREE_SET_CODE (cond, swap_tree_comparison (code));
}
  /* Invert.  */
  else
{
  swap_ssa_operands (stmt, gimple_assign_rhs2_ptr (stmt),
 gimple_assign_rhs3_ptr (stmt));
  if (STMT_VINFO_REDUC_IDX (stmt_info) == 1)
STMT_VINFO_REDUC_IDX (stmt_info) = 2;
  else if (STMT_VINFO_REDUC_IDX (stmt_info) == 2)
STMT_VINFO_REDUC_IDX (stmt_info) = 1;
  bool honor_nans = HONOR_NANS (TREE_OPERAND (cond, 0));
  code = invert_tree_comparison (TREE_CODE (cond), honor_nans);
  gcc_assert (code != ERROR_MARK);
  TREE_SET_CODE (cond, code);
}

But these changes to the gimple stmt persist even if the SLP build
fails later, or if the SLP build succeeds and we decide not to
vectorise that way.  That doesn't matter too much for the current
loop_vinfo, because the STMT_VINFO_REDUC_IDXs are still consistent
with the gimple stmt.  But it means that the STMT_VINFO_REDUC_IDXs
for the older SVE loop_vinfo are now no longer correct.

The ICE triggers when we go back to the SVE loop_vinfo as the
cheapest implementation and try to code-generate the reduction.

The testcase is reduced from 481.wrf compiled with LTO.

[Bug target/94254] [10 regression] r10-7312 causes compiler hangs

2020-03-22 Thread rsandifo at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94254

rsandifo at gcc dot gnu.org  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |rsandifo at gcc dot 
gnu.org
 Ever confirmed|0   |1
 Status|UNCONFIRMED |ASSIGNED
 CC||rsandifo at gcc dot gnu.org
   Last reconfirmed||2020-03-22

--- Comment #3 from rsandifo at gcc dot gnu.org  
---
Mine.

[Bug target/94254] [10 regression] r10-7312 causes compiler hangs

2020-03-22 Thread rsandifo at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94254

--- Comment #4 from rsandifo at gcc dot gnu.org  
---
The cycling comes from reloading:

(insn 7 6 8 2 (set (reg:SD 122 [ a32 ])
(mem/c:SD (reg/f:DI 120) [1 a32+0 S4 A32]))
"gcc/testsuite/gcc.target/powerpc/pr39902-2.c":15:13 516 {movsd_hardfloat}
 (expr_list:REG_DEAD (reg/f:DI 120)
(nil)))

r122 is assigned an FPR, and the power6 pattern doesn't provide
any alternatives that load from memory into FPRs, so we convert
this into a secondary memory reload.

Doing that for SDmode memory would cycle, but the rs6000 port has:

/* Implement TARGET_SECONDARY_RELOAD_NEEDED_MODE.  For SDmode values we 
   need to use DDmode, in all other cases we can use the same mode.  */
static machine_mode
rs6000_secondary_memory_needed_mode (machine_mode mode)
{
  if (lra_in_progress && mode == SDmode)
return DDmode;
  return mode;
}

which says that the move should happen in DDmode instead.
This means that the eventual FPR reload will happen in DDmode
rather than SDmode.

The problem is that rs6000_can_change_mode_class doesn't allow
FPRs to change from SDmode to DDmode:

  if (from_size < 8 || to_size < 8)
return false;

So there seems to be a contradiction here: secondary memory
reloads for FPRs have to happen in DDmode rather than SDmode,
but FPRs aren't allowed to change to DDmode from SDmode.

Previously this worked because LRA ignored
rs6000_can_change_mode_class and changed the mode of the
FPR regardless.  I guess that must have been the right
thing to do in context, but it would be good to pin down
exactly why the SD->DD mode change is OK for rs6000 in the
specific context of secondary memory reloads but not otherwise.

[Bug target/94254] [10 regression] r10-7312 causes compiler hangs

2020-03-22 Thread rsandifo at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94254

--- Comment #5 from rsandifo at gcc dot gnu.org  
---
(In reply to Zdenek Sojka from comment #1)
> I observe the same issue, and it breaks libgcc build for me:

What configure arguments do you use?

[Bug target/94254] [10 regression] r10-7312 causes compiler hangs

2020-03-22 Thread rsandifo at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94254

--- Comment #7 from rsandifo at gcc dot gnu.org  
---
Created attachment 48080
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=48080&action=edit
Proof-of-concept hack to back up the point in comment 4

This hack shows what I mean in comment 4.  It "fixes" the three
testcases but almost certainly isn't correct.  The point is that
before r10-7312 we did the mode change regardless of what
rs6000_can_change_mode_class said.

I don't know enough about powerpc to know what cases:

  if (from_size < 8 || to_size < 8)
return false;

is handling.

[Bug target/94254] [10 regression] r10-7312 causes compiler hangs

2020-03-23 Thread rsandifo at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94254

--- Comment #9 from rsandifo at gcc dot gnu.org  
---
(In reply to Segher Boessenkool from comment #8)
> SFmode values are stored as DP IEEE float normally.  There may be other
> cases as well, but this is the obvious one.

OK, thanks.  Does the same problem affect decimal floats (i.e. SD an DD)?

If not, maybe the answer really is to allow these cases through,
but probably without the lra_in_progress check in the hack/patch.

[Bug tree-optimization/94261] [10 Regression] ICE in vect_get_vec_def_for_operand_1 for 3-element condition reduction

2020-03-23 Thread rsandifo at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94261

--- Comment #5 from rsandifo at gcc dot gnu.org  
---
(In reply to Richard Biener from comment #3)
> (In reply to Richard Biener from comment #2)
> > when placing gcc_unreachable () at the swapping place and most testcases
> > still pass when removing the IL operand swapping, only vect-cselim-1.c
> > runfails (investigating).
> 
> Ah, SLP ultimately fails here so the scalar IL is broken by the operation
> code adjustment.  In the end we should be able to avoid doing tree code
> adjustments as well since we're swapping to make the code the same
> as the first stmts code and we only ever look at the first scalar stmt
> during SLP code-gen/analysis. 

Yeah, started wondering about that later too...

> So, it appears to work to comment _all_ of the code in
> vect_get_and_check_slp_defs (including the reduc-idx adjustment).
> 
> But again, test coverage is probably bad.

Would be great if it works.  Will give it a spin later on SVE,
although I guess in some ways that probably exercises SLP even
less (at least in length-agnostic mode).

[Bug tree-optimization/94261] [10 Regression] ICE in vect_get_vec_def_for_operand_1 for 3-element condition reduction

2020-03-23 Thread rsandifo at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94261

--- Comment #7 from rsandifo at gcc dot gnu.org  
---
(In reply to Richard Biener from comment #6)
> (In reply to rsand...@gcc.gnu.org from comment #5)
> > (In reply to Richard Biener from comment #3)
> > > (In reply to Richard Biener from comment #2)
> > > > when placing gcc_unreachable () at the swapping place and most testcases
> > > > still pass when removing the IL operand swapping, only vect-cselim-1.c
> > > > runfails (investigating).
> > > 
> > > Ah, SLP ultimately fails here so the scalar IL is broken by the operation
> > > code adjustment.  In the end we should be able to avoid doing tree code
> > > adjustments as well since we're swapping to make the code the same
> > > as the first stmts code and we only ever look at the first scalar stmt
> > > during SLP code-gen/analysis. 
> > 
> > Yeah, started wondering about that later too...
> 
> So there's probably a single path where this is not true which is when we
> go through vect_attempt_slp_rearrange_stmts thus when we have a SLPed
> set of cond-reductions.  I'm trying to come up with a testcase ...
> 
> But I think the cure here should be to punt on vect_slp_rearrange_stmts,
> do you agree?

You know SLP much better than me :-)  But yeah, that sounds good.
Would we need to punt even for swap==1, or just swap==2?

[Bug target/94254] [10 regression] r10-7312 causes compiler hangs

2020-03-23 Thread rsandifo at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94254

rsandifo at gcc dot gnu.org  changed:

   What|Removed |Added

  Attachment #48080|0   |1
is obsolete||

--- Comment #11 from rsandifo at gcc dot gnu.org  
---
Created attachment 48088
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=48088&action=edit
Candidate patch

OK, thanks. This version drops the lra_in_progress hack and
makes the test symmetric.  As mentioned in the commit
message, I'd wondered about generalising it further and
allowing any 64-bit mode, but that seems unnecessarily
dangerous for stage 4.

The previous hack passed bootstrap & regtest on gcc112.
I'll try with this version too.

[Bug target/94254] [10 regression] r10-7312 causes compiler hangs

2020-03-23 Thread rsandifo at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94254

--- Comment #13 from rsandifo at gcc dot gnu.org  
---
(In reply to Segher Boessenkool from comment #12)
> That patch looks fine, thank you!
> 
> Is there some way you can test if this works?  Ideally in the testsuite
> of course, but that can be hard :-/

I've now tried doing a bootstrap & regtest with --with-cpu-power6 as well,
comparing without the lra patch with after both patches.  (With master the
build hangs in the same way as it did for Zdenek.)  There didn't seem
to be any differences in test results.

Hopefully that will have given some execution test coverage when
running things like gcc.dg/dfp and c-c++-common/dfp.

[Bug tree-optimization/94398] ICE: in vectorizable_load, at tree-vect-stmts.c:9173

2020-03-30 Thread rsandifo at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94398

--- Comment #3 from rsandifo at gcc dot gnu.org  
---
(In reply to Richard Biener from comment #2)
> But the ICE happens because the result from the function at transform time
> does not match that at analysis time.
> 
> Richard?

Looks like we're trying to compute:

  alignment_support_scheme
= vect_supportable_dr_alignment (first_dr_info, false);
  gcc_assert (alignment_support_scheme);

even for VMAT_GATHER_SCATTER, which always accesses individual
elements.  Guess we should set alignment_support_scheme to
dr_unaligned_supported instead of calling
vect_supportable_dr_alignment.

The target hook is probably incorrect for SVE + -mstrict-align.

[Bug tree-optimization/94398] ICE: in vectorizable_load, at tree-vect-stmts.c:9173

2020-03-30 Thread rsandifo at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94398

rsandifo at gcc dot gnu.org  changed:

   What|Removed |Added

 Ever confirmed|0   |1
   Last reconfirmed||2020-03-30
 Status|UNCONFIRMED |NEW

--- Comment #4 from rsandifo at gcc dot gnu.org  
---
(In reply to rsand...@gcc.gnu.org from comment #3)
> (In reply to Richard Biener from comment #2)
> > But the ICE happens because the result from the function at transform time
> > does not match that at analysis time.
> > 
> > Richard?
> 
> Looks like we're trying to compute:
> 
>   alignment_support_scheme
> = vect_supportable_dr_alignment (first_dr_info, false);
>   gcc_assert (alignment_support_scheme);
> 
> even for VMAT_GATHER_SCATTER, which always accesses individual
> elements.  Guess we should set alignment_support_scheme to
> dr_unaligned_supported instead of calling
> vect_supportable_dr_alignment.

...in the gather/scatter case only, of course :-)

[Bug rtl-optimization/92989] [10 Regression] The mips-mti-linux-gnu fails to build after r276327

2020-04-03 Thread rsandifo at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92989

--- Comment #10 from rsandifo at gcc dot gnu.org  
---
(In reply to Jakub Jelinek from comment #9)
> Richard, could you please have a look again?

OK, I'll try to get to this early next week.

[Bug rtl-optimization/92989] [10 Regression] The mips-mti-linux-gnu fails to build after r276327

2020-04-03 Thread rsandifo at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92989

rsandifo at gcc dot gnu.org  changed:

   What|Removed |Added

 Status|REOPENED|ASSIGNED

[Bug rtl-optimization/92989] [10 Regression] The mips-mti-linux-gnu fails to build after r276327

2020-04-05 Thread rsandifo at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92989

rsandifo at gcc dot gnu.org  changed:

   What|Removed |Added

URL||https://gcc.gnu.org/piperma
   ||il/gcc-patches/2020-April/5
   ||43302.html

--- Comment #11 from rsandifo at gcc dot gnu.org  
---
Patch posted here:
https://gcc.gnu.org/pipermail/gcc-patches/2020-April/543302.html

[Bug rtl-optimization/92989] [10 Regression] The mips-mti-linux-gnu fails to build after r276327

2020-04-06 Thread rsandifo at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92989

rsandifo at gcc dot gnu.org  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #13 from rsandifo at gcc dot gnu.org  
---
I think it really has been fixed this time.  I was able to build a
mips-mti-linux-gnu toolchain to completion (with libsanitizer disabled).

Sorry for the runaround.

[Bug rtl-optimization/94605] New: [8/9/10 Regression] ICE in early-remat.c:process_block with multi-output asms

2020-04-15 Thread rsandifo at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94605

Bug ID: 94605
   Summary: [8/9/10 Regression] ICE in early-remat.c:process_block
with multi-output asms
   Product: gcc
   Version: 10.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: rtl-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: rsandifo at gcc dot gnu.org
  Target Milestone: ---
Target: aarch64*-*-*

Compiling the following testcase with -O2 -march=armv8.2-a+sve:

typedef int v8si __attribute__((vector_size(32)));
int g (v8si, v8si);

void
f (void)
{
  v8si x = {}, y = {};
  while (g (x, y))
asm ("" : "+w" (x), "+w" (y));
}

triggers:

internal compiler error: in process_block, at early-remat.c:2051

This is because process_block doesn't handle insns that set
multiple candidate registers.

[Bug rtl-optimization/94605] [8/9/10 Regression] ICE in early-remat.c:process_block with multi-output asms

2020-04-15 Thread rsandifo at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94605

rsandifo at gcc dot gnu.org  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |rsandifo at gcc dot 
gnu.org
 Status|UNCONFIRMED |ASSIGNED
   Target Milestone|--- |8.4
 Ever confirmed|0   |1
   Last reconfirmed||2020-04-15

--- Comment #1 from rsandifo at gcc dot gnu.org  
---
Have a patch.

[Bug target/94606] [10 Regression] ICE creating fixed-length SVE predicate

2020-04-15 Thread rsandifo at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94606

rsandifo at gcc dot gnu.org  changed:

   What|Removed |Added

 Status|UNCONFIRMED |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |rsandifo at gcc dot 
gnu.org
 Ever confirmed|0   |1
   Last reconfirmed||2020-04-15

--- Comment #1 from rsandifo at gcc dot gnu.org  
---
Mine.

[Bug target/94606] New: [10 Regression] ICE creating fixed-length SVE predicate

2020-04-15 Thread rsandifo at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94606

Bug ID: 94606
   Summary: [10 Regression] ICE creating fixed-length SVE
predicate
   Product: gcc
   Version: 10.0
Status: UNCONFIRMED
  Keywords: ice-on-valid-code
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: rsandifo at gcc dot gnu.org
  Target Milestone: ---
Target: aarch64*-*-*

Compiling the following testcase with -O3 -march=armv8.2-a+sve
-msve-vector-bits=256:

const short mask[] = { 0, 0, 0, 0, 0, 0, 0, 0,
   0, 0, 0, 1, 1, 1, 1, 1 };

int
foo (short *restrict x, short *restrict y)
{
  for (int i = 0; i < 16; ++i)
if (mask[i])
  x[i] += y[i];
}

gives:

error: unrecognizable insn:
   10 | }
  | ^
(insn 10 9 11 2 (set (reg:VNx16BI 105)
(and:VNx16BI (xor:VNx16BI (reg:VNx8BI 103)
(reg:VNx16BI 104))
(reg:VNx16BI 104))) "/tmp/bar.c":9:12 -1
 (nil))
during RTL pass: vregs

because the first operand to the xor has the wrong mode.

[Bug target/94606] [10 Regression] ICE creating fixed-length SVE predicate

2020-04-16 Thread rsandifo at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94606

rsandifo at gcc dot gnu.org  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #3 from rsandifo at gcc dot gnu.org  
---
Fixed on master.

[Bug target/94668] New: [10 Regression] ICE generating float vec_inits since r10-808

2020-04-20 Thread rsandifo at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94668

Bug ID: 94668
   Summary: [10 Regression] ICE generating float vec_inits since
r10-808
   Product: gcc
   Version: 10.0
Status: UNCONFIRMED
  Keywords: ice-on-valid-code
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: rsandifo at gcc dot gnu.org
  Target Milestone: ---
Target: aarch64*-*-*

Compiling the following with -O -march=armv8.2-a+sve -msve-vector-bits=512:

typedef float v16sf __attribute__ ((vector_size(64)));
v16sf
foo (float a)
{
  return (v16sf) { 0, 0, 0, a, 0, 0, 0, 0, 0, a, 0, 0, 0, 0, 0, 0 };
}

gives:

foo.c:5:10: internal compiler error: in decompose, at rtl.h:2296
5 |   return (v16sf) { 0, 0, 0, a, 0, 0, 0, 0, 0, a, 0, 0, 0, 0, 0, 0 };
  |  ^
0xd8db39 wi::int_traits >::decompose(long*,
unsigned int, std::pair const&)
src/gcc/gcc/rtl.h:2296
0xd8db39 wide_int_ref_storage::wide_int_ref_storage
>(std::pair const&, unsigned int)
src/gcc/gcc/wide-int.h:1034
0xd8db39 generic_wide_int
>::generic_wide_int >(std::pair const&, unsigned int)
src/gcc/gcc/wide-int.h:790
0xd8db39 wi::binary_traits,
std::pair, wi::int_traits >::precision_type, wi::int_traits >::precision_type>::result_type wi::sub, std::pair >(std::pair const&, std::pair const&)
src/gcc/gcc/wide-int.h:2510
0xd8db39 rtx_vector_builder::step(rtx_def*, rtx_def*) const
src/gcc/gcc/rtx-vector-builder.h:122
0xd8db39 vector_builder::elt(unsigned int) const
src/gcc/gcc/vector-builder.h:254
0x1216012 aarch64_sve_expand_vector_init_handle_trailing_constants
src/gcc/gcc/config/aarch64/aarch64.c:18393
0x1218fdc aarch64_sve_expand_vector_init
src/gcc/gcc/config/aarch64/aarch64.c:18517
0x1219248 aarch64_sve_expand_vector_init
src/gcc/gcc/config/aarch64/aarch64.c:18558
0x121926d aarch64_sve_expand_vector_init
src/gcc/gcc/config/aarch64/aarch64.c:18562
0x1219710 aarch64_sve_expand_vector_init(rtx_def*, rtx_def*)
src/gcc/gcc/config/aarch64/aarch64.c:18601
0x163d4eb gen_vec_initvnx16qiqi(rtx_def*, rtx_def*)
src/gcc/gcc/config/aarch64/aarch64-sve.md:2535
0x9c9cde insn_gen_fn::operator()(rtx_def*, rtx_def*) const
src/gcc/gcc/recog.h:317
0x9c9cde store_constructor
src/gcc/gcc/expr.c:6965
0x9cc8fc expand_constructor
src/gcc/gcc/expr.c:8279
0x9b35b9 expand_expr_real_1(tree_node*, rtx_def*, machine_mode,
expand_modifier, rtx_def**, bool)
src/gcc/gcc/expr.c:10388
0x9b5038 expand_expr_real_1(tree_node*, rtx_def*, machine_mode,
expand_modifier, rtx_def**, bool)
src/gcc/gcc/expr.c:10058
0x9c404a store_expr(tree_node*, rtx_def*, int, bool, bool)
src/gcc/gcc/expr.c:5752
0x9c5d67 expand_assignment(tree_node*, tree_node*, bool)
src/gcc/gcc/expr.c:5514
0x852437 expand_gimple_stmt_1
src/gcc/gcc/cfgexpand.c:3749
Please submit a full bug report,
with preprocessed source if appropriate.
Please include the complete backtrace with any bug report.
See <https://gcc.gnu.org/bugs/> for instructions.

[Bug target/94668] [10 Regression] ICE generating float vec_inits since r10-808

2020-04-20 Thread rsandifo at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94668

rsandifo at gcc dot gnu.org  changed:

   What|Removed |Added

 Status|UNCONFIRMED |ASSIGNED
 Ever confirmed|0   |1
   Assignee|unassigned at gcc dot gnu.org  |rsandifo at gcc dot 
gnu.org
   Last reconfirmed||2020-04-20

--- Comment #1 from rsandifo at gcc dot gnu.org  
---
The problem is that the rtx_vector_builder arguments are the
wrong way around.  Testing a patch.

[Bug target/94668] [10 Regression] ICE generating float vec_inits since r10-808

2020-04-20 Thread rsandifo at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94668

rsandifo at gcc dot gnu.org  changed:

   What|Removed |Added

   Target Milestone|--- |10.0

[Bug target/94668] [10 Regression] ICE generating float vec_inits since r10-808

2020-04-20 Thread rsandifo at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94668

rsandifo at gcc dot gnu.org  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #3 from rsandifo at gcc dot gnu.org  
---
Fixed on trunk.

[Bug tree-optimization/94683] New: ICE in forwprop when folding to a VEC_PERM_EXPR

2020-04-21 Thread rsandifo at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94683

Bug ID: 94683
   Summary: ICE in forwprop when folding to a VEC_PERM_EXPR
   Product: gcc
   Version: 10.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: rsandifo at gcc dot gnu.org
  Target Milestone: ---

Compiling this with -march=armv8.2-a+sve -msve-vector-bits=256 -O3:

#include 

typedef float v8sf __attribute__((vector_size(32)));

svfloat32_t
f (svbool_t pg, svfloat32_t x, svfloat32_t y)
{
  v8sf a = svadd_x (pg, x, 1);
  v8sf b = { a[0], a[0], a[2], a[2], a[4], a[4], a[6], a[6] };
  return svdiv_x (pg, b, y);
}

gives:

foo.c: In function ‘f’:
foo.c:16:1: error: non-trivial conversion in ‘ssa_name’
   16 | }
  | ^
v8sf
__SVFloat32_t
b_17 = _16;
during GIMPLE pass: forwprop
dump file: foo.031t.forwprop1

[Bug tree-optimization/94683] ICE in forwprop when folding to a VEC_PERM_EXPR

2020-04-21 Thread rsandifo at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94683

rsandifo at gcc dot gnu.org  changed:

   What|Removed |Added

 Ever confirmed|0   |1
   Last reconfirmed||2020-04-21
 Status|UNCONFIRMED |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |rsandifo at gcc dot 
gnu.org

--- Comment #1 from rsandifo at gcc dot gnu.org  
---
Testing a patch.

[Bug tree-optimization/94683] ICE in forwprop when folding to a VEC_PERM_EXPR

2020-04-21 Thread rsandifo at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94683

rsandifo at gcc dot gnu.org  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #3 from rsandifo at gcc dot gnu.org  
---
Fixed on master.

[Bug tree-optimization/94700] New: ICE in forwprop when folding to a VEC_PERM_EXPR

2020-04-21 Thread rsandifo at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94700

Bug ID: 94700
   Summary: ICE in forwprop when folding to a VEC_PERM_EXPR
   Product: gcc
   Version: 10.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: rsandifo at gcc dot gnu.org
  Target Milestone: ---

Compiling this with -march=armv8.2-a+sve -msve-vector-bits=256 -O3:

#include 

typedef float v8sf __attribute__((vector_size(32)));

svfloat32_t
f (svbool_t pg, svfloat32_t x, svfloat32_t y)
{
  v8sf a = svadd_x (pg, x, 1);
  v8sf b = { a[0], a[1], a[2], a[3], a[4], a[5], a[6], a[7] };
  return svdiv_x (pg, b, y);
}

gives:

foo.c: In function ‘svfloat32_t f(svbool_t, svfloat32_t, svfloat32_t)’: 
foo.c:11:1: error: non-trivial conversion in ‘ssa_name’ 
   11 | }   
  | ^   
v8sf
__SVFloat32_t   
b_18 = _1;  
during GIMPLE pass: forwprop
dump file: foo.031t.forwprop1   

This is the maybe_ident equivalent of PR94683.  I should have
realised at the time that both arms would need fixing.

[Bug tree-optimization/94700] ICE in forwprop when folding to a VEC_PERM_EXPR

2020-04-21 Thread rsandifo at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94700

rsandifo at gcc dot gnu.org  changed:

   What|Removed |Added

   Keywords||ice-on-valid-code
 Ever confirmed|0   |1
   Last reconfirmed||2020-04-21
 Target||aarch64*-*-*
 Status|UNCONFIRMED |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |rsandifo at gcc dot 
gnu.org

--- Comment #1 from rsandifo at gcc dot gnu.org  
---
Testing a patch.

[Bug tree-optimization/94700] ICE in forwprop when folding to a VEC_PERM_EXPR

2020-04-22 Thread rsandifo at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94700

rsandifo at gcc dot gnu.org  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #3 from rsandifo at gcc dot gnu.org  
---
Fixed on master.

[Bug target/94678] aarch64: unexpected result with -mgeneral-regs-only and sve

2020-04-22 Thread rsandifo at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94678

rsandifo at gcc dot gnu.org  changed:

   What|Removed |Added

 CC||rsandifo at gcc dot gnu.org
 Resolution|--- |FIXED
 Status|UNCONFIRMED |RESOLVED

--- Comment #3 from rsandifo at gcc dot gnu.org  
---
Fixed on master.  Thanks for the patch.

[Bug c/94712] aarch64:Adjust some testcases for ilp32 option conflict

2020-04-22 Thread rsandifo at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94712

rsandifo at gcc dot gnu.org  changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |FIXED
 CC||rsandifo at gcc dot gnu.org

--- Comment #2 from rsandifo at gcc dot gnu.org  
---
Fixed on master.  Thanks for the patch.

[Bug tree-optimization/94727] [10 Regression] GCC produces incorrect code with -O3 since r10-5071-g02d895504cc59be0

2020-04-23 Thread rsandifo at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94727

rsandifo at gcc dot gnu.org  changed:

   What|Removed |Added

 Status|NEW |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |rsandifo at gcc dot 
gnu.org

--- Comment #4 from rsandifo at gcc dot gnu.org  
---
OK, I'll take a look.

[Bug tree-optimization/94727] [10 Regression] GCC produces incorrect code with -O3 since r10-5071-g02d895504cc59be0

2020-04-23 Thread rsandifo at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94727

--- Comment #5 from rsandifo at gcc dot gnu.org  
---
Well, this is a bit of mess (surprise).  We have a "<" comparison
between two booleans that are leaves of the SLP tree, so
vectorizable_comparison falls back on:

  /* Invariant comparison.  */
  if (!vectype)
{
  vectype = get_vectype_for_scalar_type (vinfo, TREE_TYPE (rhs1),
 slp_node);
  if (maybe_ne (TYPE_VECTOR_SUBPARTS (vectype), nunits))
return false;
}

rhs1 and rhs2 are *unsigned* boolean types, so we get back a vector
of unsigned integers.  All is well, and "<" works as expected without
the need for:

  /* Boolean values may have another representation in vectors
 and therefore we prefer bit operations over comparison for
 them (which also works for scalar masks).  We store opcodes
 to use in bitop1 and bitop2.  Statement is vectorized as
   BITOP2 (rhs1 BITOP1 rhs2) or
   rhs1 BITOP2 (BITOP1 rhs2)
 depending on bitop1 and bitop2 arity.  */
  bool swap_p = false;
  if (VECTOR_BOOLEAN_TYPE_P (vectype))
{

However, we then defer to vect_get_slp_defs to get the actual operands.
The expected vector type is not part of this interface.  The request
goes to vect_get_constant_vectors, which does:

  if (VECT_SCALAR_BOOLEAN_TYPE_P (TREE_TYPE (op))
  && vect_mask_constant_operand_p (stmt_vinfo))
vector_type = truth_type_for (stmt_vectype);
  else
vector_type = get_vectype_for_scalar_type (vinfo, TREE_TYPE (op), op_node);

So the function gives back a vector of mask types, which here are
vectors of *signed* booleans.  This means that "<" gives:

  true (-1) < false (0)

and so the boolean fixup above was needed after all.

I'm going to try:

diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c
index 7f3a9fb5fb3..88a1e2c51d2 100644
--- a/gcc/tree-vect-stmts.c
+++ b/gcc/tree-vect-stmts.c
@@ -10566,8 +10566,11 @@ vectorizable_comparison (stmt_vec_info stmt_info,
gimple_stmt_iterator *gsi,
   /* Invariant comparison.  */
   if (!vectype)
 {
-  vectype = get_vectype_for_scalar_type (vinfo, TREE_TYPE (rhs1),
-slp_node);
+  if (VECT_SCALAR_BOOLEAN_TYPE_P (TREE_TYPE (rhs1)))
+   vectype = mask_type;
+  else
+   vectype = get_vectype_for_scalar_type (vinfo, TREE_TYPE (rhs1),
+  slp_node);
   if (!vectype || maybe_ne (TYPE_VECTOR_SUBPARTS (vectype), nunits))
return false;
 }

which does at least fix the testcase.

[Bug tree-optimization/94727] [10 Regression] GCC produces incorrect code with -O3 since r10-5071-g02d895504cc59be0

2020-04-23 Thread rsandifo at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94727

--- Comment #7 from rsandifo at gcc dot gnu.org  
---
(In reply to Richard Biener from comment #6)
> (In reply to rsand...@gcc.gnu.org from comment #5)
> > 
> > However, we then defer to vect_get_slp_defs to get the actual operands.
> > The expected vector type is not part of this interface.
> 
> Ah yeah - sth on my list to fix (not making the type part of that API
> but assigning vector types to SLP nodes).  I even have partly completed
> "hacks" to do that.  When we have (and use!) vector types on all SLP
> nodes we can also get rid of the mismatch code.

Sounds great!  The fewer decisions we make on the fly the better...

> > I'm going to try:
> > 
> > diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c
> > index 7f3a9fb5fb3..88a1e2c51d2 100644
> > --- a/gcc/tree-vect-stmts.c
> > +++ b/gcc/tree-vect-stmts.c
> > @@ -10566,8 +10566,11 @@ vectorizable_comparison (stmt_vec_info stmt_info,
> > gimple_stmt_iterator *gsi,
> >/* Invariant comparison.  */
> >if (!vectype)
> >  {
> > -  vectype = get_vectype_for_scalar_type (vinfo, TREE_TYPE (rhs1),
> > -slp_node);
> > +  if (VECT_SCALAR_BOOLEAN_TYPE_P (TREE_TYPE (rhs1)))
> > +   vectype = mask_type;
> > +  else
> > +   vectype = get_vectype_for_scalar_type (vinfo, TREE_TYPE (rhs1),
> > +  slp_node);
> >if (!vectype || maybe_ne (TYPE_VECTOR_SUBPARTS (vectype), nunits))
> > return false;
> >  }
> > 
> > which does at least fix the testcase.
> 
> LGTM.

Thanks.  aarch64-linux-gnu and x86_64-linux-gnu bootstrapped passed,
now trying an SVE test run.  Will commit if that passes too.

[Bug tree-optimization/94727] [10 Regression] GCC produces incorrect code with -O3 since r10-5071-g02d895504cc59be0

2020-04-23 Thread rsandifo at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94727

rsandifo at gcc dot gnu.org  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #9 from rsandifo at gcc dot gnu.org  
---
Fixed on master.  Thanks for the report.

[Bug tree-optimization/94784] ICE: in simplify_vector_constructor, at tree-ssa-forwprop.c:2482

2020-04-27 Thread rsandifo at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94784

rsandifo at gcc dot gnu.org  changed:

   What|Removed |Added

 CC||rsandifo at gcc dot gnu.org
 Resolution|--- |FIXED
 Status|UNCONFIRMED |RESOLVED

--- Comment #4 from rsandifo at gcc dot gnu.org  
---
Fixed on trunk.

[Bug target/94852] -ffloat-store on x64 target

2020-04-30 Thread rsandifo at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94852

rsandifo at gcc dot gnu.org  changed:

   What|Removed |Added

 Status|RESOLVED|NEW
 CC||rsandifo at gcc dot gnu.org
   Severity|normal  |enhancement
 Ever confirmed|0   |1
 Resolution|DUPLICATE   |---
   Last reconfirmed||2020-04-30

--- Comment #7 from rsandifo at gcc dot gnu.org  
---
Reopening.  I don't think this is a dup of PR323.  The problem
is instead that the workaround for PR323 is silently accepted
and applied even on targets for which the workaround isn't needed.
That seems like a valid point to me, and if we don't have an
earlier PR making the same point, I think we should keep this
one open.

I agree that at the very least the documentation could be improved
(but I'm not volunteering :-))

[Bug tree-optimization/94899] Failure to optimize out add before compare with INT_MIN

2020-05-01 Thread rsandifo at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94899

rsandifo at gcc dot gnu.org  changed:

   What|Removed |Added

   Keywords||easyhack
 Status|RESOLVED|NEW
   Last reconfirmed||2020-05-01
 CC||rsandifo at gcc dot gnu.org
 Resolution|INVALID |---
 Ever confirmed|0   |1

--- Comment #4 from rsandifo at gcc dot gnu.org  
---
Reopening because...

(In reply to Andrew Pinski from comment #2)
> This is invalid as 0x8000 is unsigned (C90/C++03) or long (C99/C++11) in
> type.
> Which means then overflow is not undefined but rather wrapping.

It's unsigned int for C99/C++11 too (see the different handling of
decimal-literals and other integer-literals in [lex.icon.type]).

This means that the result of the addition is also unsigned,
For the specific value of 0x8000, the transformation is monotonic,
so the optimisation is valid and well-defined.

[Bug rtl-optimization/94873] [8/9/10/11 Regression] wrong code with -O -fno-merge-constants -fno-split-wide-types -fno-tree-fre

2020-05-01 Thread rsandifo at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94873

rsandifo at gcc dot gnu.org  changed:

   What|Removed |Added

 CC||rsandifo at gcc dot gnu.org

--- Comment #6 from rsandifo at gcc dot gnu.org  
---
(In reply to Jeffrey A. Law from comment #5)
> I can't see how a REG_EQUAL note on an insn with multiple outputs can
> possibly work -- we wouldn't know what output the REG_EQUAL note refers to. 
> And we have to consider an embedded side effect as having an output.
> 
> Or to think of it another way, any embedded side effect can be implemented
> with a parallel at which point it's painfully obvious the insn has multiple
> outputs and a REG_EQUAL note would be inappropriate.

Yeah, I can see that argument, but to play devil's advocate:

I think the requirement for having a single REG SET_DEST makes
sense because the REG_EQUAL note would be genuinely ambiguous if
there were multiple REG SET_DESTs.  But in the case of a REG_INC
insn with a single REG SET_DEST, there's no ambiguity about which
register is meant.

I guess there's also the problem that stack pushes don't need a
REG_INC note, so anything we do can't just be keyed off REG_INC.
The only sure way to check whether a register is set as a side-effect
is to look at the complete pattern (like dse.c:check_for_inc_dec).

So I think there's the argument that optimisers have to be wary
of this in the same way that they need to be wary of folding:

   (set (reg X)
(and (mem (pre_inc (reg Y)))
 (reg Z)))

into

   (set (reg X) (const_int 0))

when Z can be proven to be zero.

[Bug rtl-optimization/94873] [8/9/10/11 Regression] wrong code with -O -fno-merge-constants -fno-split-wide-types -fno-tree-fre

2020-05-01 Thread rsandifo at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94873

--- Comment #8 from rsandifo at gcc dot gnu.org  
---
(In reply to Segher Boessenkool from comment #7)
> REG_EQ* is documented as only being allowed on insns that set only one
> register.  If you want to change that, you'll have to check *all* code
> that consumes this, see if they rely on that fact or not, and if so,
> change that.

But the point is that the word "set" is ambiguous.  Does it mean
set by a SET or set by any means?  I think it can be read both ways.
After all, a CLOBBER is a form of set too, but that's clearly meant
to be excluded.  

Either way will need auditing.  If we say that this kind of REG_EQUAL
is wrong, we'd in theory need to audit everything that adds REG_EQUAL
notes to make sure it has an appropriate check, or doesn't need one.

I'm also not sure if we really are concerned about multiple registers
in this particular case, or whether it's more the case that we don't
want REG_EQUAL notes on insns with side effects.

[Bug middle-end/94941] New: Expansion of some internal fns can drop the lhs on the floor

2020-05-04 Thread rsandifo at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94941

Bug ID: 94941
   Summary: Expansion of some internal fns can drop the lhs on the
floor
   Product: gcc
   Version: 10.0
Status: UNCONFIRMED
  Keywords: wrong-code
  Severity: normal
  Priority: P3
 Component: middle-end
  Assignee: unassigned at gcc dot gnu.org
  Reporter: rsandifo at gcc dot gnu.org
  Target Milestone: ---

internal-fn.c:expand_mask_load_optab_fn uses expand_insn to emit the
load instruction, but doesn't then test whether the coerced output
operand is the same as the target of the gcall.  It might not be,
for example, in unoptimised code, where the target of the gcall
expands to a MEM rtx and the load insn requires a REG destination.
We need the equivalent of:

  if (!rtx_equal_p (lhs_rtx, ops[0].value))
emit_move_insn (lhs_rtx, ops[0].value);

in expand_while_optab_fn.

This can be seen for AArch64 with the following test,
compiled with -O0 -march=armv8.2-a+sve:

--
#include 

svfloat32_t
foo (float *ptr)
{
  svbool_t pg = svptrue_pat_b32 (SV_VL1);
  svfloat32_t res = svld1 (pg, ptr);
  return res;
}

int
main (void)
{
  svbool_t pg = svptrue_pat_b32 (SV_VL1);
  float x[1] = { 1 };
  if (svptest_any (pg, svcmpne (pg, foo (x), 1.0)))
__builtin_abort ();
  return 0;
}
--

We emit:
;; res_5 = .MASK_LOAD (ptr_4(D), 4B, _2);

(insn 9 8 10 (set (reg/f:DI 96)
(mem/f/c:DI (plus:DI (reg/f:DI 87 virtual-stack-vars)
(const_poly_int:DI [-40, -32])) [3 ptr+0 S8 A64]))
"/tmp/foo.c":7:21 -1
 (nil))

(insn 10 9 0 (set (reg:VNx4SF 97)
(unspec:VNx4SF [
(reg:VNx4BI 92 [ _2 ])
(mem:VNx4SF (reg/f:DI 96) [0 MEM  [(float
*)ptr_4(D)]+0 S[16, 16] A8])
] UNSPEC_LD1_SVE)) "/tmp/foo.c":7:21 -1
 (nil))

but don't store reg 97 to the stack slot for "res".  Then the return
statement loads from "res":

(insn 12 11 0 (set (reg:VNx4SF 93 [ _6 ])
(unspec:VNx4SF [
(subreg:VNx4BI (reg:VNx16BI 98) 0)
(mem/c:VNx4SF (plus:DI (reg/f:DI 87 virtual-stack-vars)
(const_poly_int:DI [-32, -32])) [2 res+0 S[16, 16]
A128])
] UNSPEC_PRED_X)) "/tmp/foo.c":8:10 -1
 (nil))

meaning we return uninitialised stack contents.

The same problem affects expand_load_lanes_optab_fn and
expand_gather_load_optab_fn.

I think this problem has existed since the mask load/store
functions were introduced, but it was probably latent until
GCC 10 because nothing would use them in unoptimised code.

[Bug middle-end/94941] Expansion of some internal fns can drop the lhs on the floor

2020-05-04 Thread rsandifo at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94941

rsandifo at gcc dot gnu.org  changed:

   What|Removed |Added

   Last reconfirmed||2020-05-04
 Ever confirmed|0   |1
 Status|UNCONFIRMED |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |rsandifo at gcc dot 
gnu.org

--- Comment #1 from rsandifo at gcc dot gnu.org  
---
Testing a patch.

[Bug rtl-optimization/94873] [8/9/10/11 Regression] wrong code with -O -fno-merge-constants -fno-split-wide-types -fno-tree-fre

2020-05-04 Thread rsandifo at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94873

--- Comment #11 from rsandifo at gcc dot gnu.org  
---
(In reply to Segher Boessenkool from comment #9)
> "clobber" is a red herring; it is impossible to make a REG_EQ* note for
> a clobber, a clobber does not set a new value (that is the whole point
> of a clobber).

It's not possible to attach a REG_EQ* note to an auto-inc-dec
either though (that was my point).  So a clobber doesn't seem
any less of a red herring than the auto-inc-dec itself.  Both
set registers in the sense of changing them, and neither can
be described by a REG_EQ* note.

> I think we could allow auto-modify, sure, just as long as it stays clear
> what lhs the REG_EQ* note is talking about; and, as you say, everything
> needs to be audited for it :-/

Yeah.  But to be clear: I don't think this is more obviously
a change from the status quo than going in the other direction
would be.  The point is that the status quo is ambiguous:
the documentation can be read either way, and the
implementation isn't consistent (hence the bug).

So it's a question of how we resolve the ambiguity.
If we want passes to be able to assume without checking
that insns with REG_EQ* notes don't also include auto
inc-dec, we'll need to audit places that create the notes,
or that update insns with existing notes.

I think it comes down to what the REG_EQ* notes are supposed
to achieve (conceptually, ignoring documentation and the
current implementation for now).  The "weak" guarantee is
that the SET_DEST has the specified value after the
instruction.  The "strong" guarantee extends the weak guarantee
by saying that the SET_SRC of the definition can be replaced
by the REG_EQ* note without changing behaviour.

Having auto-inc-dec in the SET_SRC of the definition is
OK for the weak guarantee but not the strong guarantee.
But the same would be true of any SET_SRC with side effects.
So to frame the question in a different way: let's assume
there's a target-specific intrinsic that has side effects
that can be described using unspec_volatile, and that the
intrinsic also sets a register.  Normally this would be
described as:

  (set (reg X) (unspec_volatile ...))

But if, in a particular context, the target could predict
what the value of X was, could it attach a REG_EQ* note
to say that?  It would then be valid to simplify later
uses of X, even though the definition of X can't change.

IMO this is easier to answer for REG_EQUIV.  That mostly
exists to allow the RA to rematerialise a value instead
of reloading it.  So it's all about replacing the uses
of the register rather than about replacing the definition.
(I'm not saying that the RA would handle a REG_EQUIV note
on the above unspec_volatile correctly -- haven't checked
either way -- but in principle it could.)

The weak guarantee makes life harder for consumers of the
notes that want the strong guarantee, since they then have
to check for side effects themselves.  The weak guarantee
is easy for producers of the notes and for consumers that
only care about users of the register.

The strong guarantee makes life harder for producers of the
notes, or for optimisations that modify insns with existing
notes.  The strong guarantee is easy for consumers of the notes
because it's more conservative.

The weak guarantee potentially allows more optimisation
than the strong guarantee.

I don't think there's much in it.  But I guess I personally
prefer the weak guarantee for the "more optimisation" reason.
There's also a very tenuous analogy with REG_RETURNED,
which is explicitly for saying what the return value is,
rather than saying how the definition can be rewritten.

But whichever we go for, I think it should be a decision
about side effects vs. no side effects, with auto inc-dec
being just one of several potential side effects.  I don't
see any reason why the auto-inc-dec case would be different
from the unspec_volatile case.

[Bug middle-end/94941] Expansion of some internal fns can drop the lhs on the floor

2020-05-04 Thread rsandifo at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94941

rsandifo at gcc dot gnu.org  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #4 from rsandifo at gcc dot gnu.org  
---
Fixed in master and on the GCC 10 branch.

[Bug rtl-optimization/94873] [8/9/10/11 Regression] wrong code with -O -fno-merge-constants -fno-split-wide-types -fno-tree-fre

2020-05-04 Thread rsandifo at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94873

rsandifo at gcc dot gnu.org  changed:

   What|Removed |Added

 CC||ebotcazou at gcc dot gnu.org

--- Comment #12 from rsandifo at gcc dot gnu.org  
---
Just noticed that Eric wasn't on cc: (although he might have
preferred that it stay that way).

[Bug target/92469] ICE: output_operand: invalid use of register 'frame' in 7/8/9/10

2020-05-05 Thread rsandifo at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92469

--- Comment #10 from rsandifo at gcc dot gnu.org  
---
"jakub at gcc dot gnu.org"  writes:
> --- Comment #9 from Jakub Jelinek  ---
> Seems neither accessible_reg_set nor operand_reg_set can exclude frame, 
> because
> the latter results in general_operand predicate failing for it and the former
> results into the latter not being enabled either.
> So, if we wanted to reject this right away, we'd have to add another hard reg
> set (e.g. containing gcc internal artifical regs that shouldn't appear in asm
> register specification), or a target hook that would reject such registers 
> that
> varasm.c would call.

Agree a hard reg set would be good.  We should be able to fill in
the easy cases automatically, e.g. FRAME_POINTER_REGNUM when it's
!= HARD_FRAME_POINTER_REGNUM.  But some cases would still need
target help.

[Bug target/94980] [8/9/10/11 Regression] ICE: verify_gimple failed: position plus size exceeds size of referenced object in 'bit_field_ref' with -mavx512vl

2020-05-07 Thread rsandifo at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94980

rsandifo at gcc dot gnu.org  changed:

   What|Removed |Added

 Status|NEW |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |rsandifo at gcc dot 
gnu.org

--- Comment #2 from rsandifo at gcc dot gnu.org  
---
Mine.  Looks like the problem was latent before.  The testcase
ends up using vector boolean types with 2-bit element types.
Before that patch we computed the TYPE_SIZE of the vector type as:

TYPE_SIZE (type) = int_const_binop (MULT_EXPR,
TYPE_SIZE (innertype),
bitsize_int (nunits));

where the TYPE_SIZE of the boolean is 8 bits.  So the TYPE_SIZE
of the vector was 4*8 == 32 bits, even though the TYPE_SIZE_UNIT
was only 1 byte.  This meant that the BIT_FIELD_REFs passed the
tree-cfg.c checks but led to incorrect rtl:

;; _13 = { -1, -1, -1, -1 };

(insn 18 17 0 (set (reg:QI 92 [ _13 ])
(const_int 15 [0xf])) "/tmp/pr94980.c":6:17 -1
 (nil))

;; _14 = BIT_FIELD_REF <_13, 8, 8>;

(insn 19 18 20 (set (mem/c:QI (plus:DI (reg/f:DI 77 virtual-stack-vars)
(const_int -4 [0xfffc])) [2  S1 A8])
(reg:QI 92 [ _13 ])) "/tmp/pr94980.c":6:17 -1
 (nil))

(insn 20 19 0 (set (reg:QI 93 [ _14 ])
(mem/c:QI (plus:DI (reg/f:DI 77 virtual-stack-vars)
(const_int -3 [0xfffd])) [0  S1 A8]))
"/tmp/pr94980.c":6:17 -1
 (nil))

where the last insn reads back uninitialised memory.

[Bug target/94980] [8/9/10/11 Regression] ICE: verify_gimple failed: position plus size exceeds size of referenced object in 'bit_field_ref' with -mavx512vl

2020-05-12 Thread rsandifo at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94980

rsandifo at gcc dot gnu.org  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #6 from rsandifo at gcc dot gnu.org  
---
Fixed on master.  As mentioned on-list, GCC 7 generated wrong code
instead of ICEing, and there's a danger GCC 8+ could regress to that
in other cases if we backported the patches.  So instead it seems
more appropriate to leave the branches alone for now and perhaps
revisit if a real-world test case comes along.  We might by that
time have found more places that need similar fixes.

[Bug target/95105] New: Bogus reference-compatibility error for arm_sve_vector_bits

2020-05-13 Thread rsandifo at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95105

Bug ID: 95105
   Summary: Bogus reference-compatibility error for
arm_sve_vector_bits
   Product: gcc
   Version: 10.1.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: rsandifo at gcc dot gnu.org
  Target Milestone: ---
Target: aarch64*-*-*

Compiling this testcase with -march=armv8.2-a+sve
-msve-vector-bits=512:

--
typedef __SVFloat32_t foo;
typedef foo bar __attribute__((arm_sve_vector_bits(512)));
template struct s { T x; };
extern s a;
bar &b = a.x;
--

gives the bogus error:

error: cannot bind non-const lvalue reference of type ‘bar&’ {aka
‘__SVFloat32_t&’} to an rvalue of type ‘bar’ {aka ‘__SVFloat32_t’}
5 | bar &b = a.x;
  |  ~~^

The testcase works if the attribute is applied directly
to __SVFloat32_t.

The bug is in the target-specific handling of the attribute.

[Bug target/95105] Bogus reference-compatibility error for arm_sve_vector_bits

2020-05-13 Thread rsandifo at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95105

rsandifo at gcc dot gnu.org  changed:

   What|Removed |Added

 Ever confirmed|0   |1
   Target Milestone|--- |10.2
 Status|UNCONFIRMED |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |rsandifo at gcc dot 
gnu.org
   Last reconfirmed||2020-05-13

[Bug middle-end/95114] New: [9/10/11 Regression] ICE in obj_type_ref_class for structural-equality types

2020-05-13 Thread rsandifo at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95114

Bug ID: 95114
   Summary: [9/10/11 Regression] ICE in obj_type_ref_class for
structural-equality types
   Product: gcc
   Version: 9.0
Status: UNCONFIRMED
  Keywords: ice-on-valid-code
  Severity: normal
  Priority: P3
 Component: middle-end
  Assignee: unassigned at gcc dot gnu.org
  Reporter: rsandifo at gcc dot gnu.org
CC: hubicka at gcc dot gnu.org
  Target Milestone: ---
Target: aarch64*-*-*

Since r9-5035-g4611c03d2b0edefc8d8e17872ef143428f56380b,
the following testcase has ICEd for aarch64, compiled at -O0:

---
template struct foo { virtual void f() = 0; };
extern foo<__Int8x8_t> &x;
void f() { x.f(); }
---

The exact ICE has changed over time, but the current form is:

---
during GIMPLE pass: *rebuild_cgraph_edges
foo.c: In function ‘void f()’:
foo.c:3:19: internal compiler error: Segmentation fault
3 | void f() { x.f(); }
  |   ^
0x16cfec1 crash_signal
src/gcc/gcc/toplev.c:328
0x129a4ca obj_type_ref_class(tree_node const*)
src/gcc/gcc/ipa-devirt.c:1915
0x1afe3d7 virtual_method_call_p(tree_node const*)
src/gcc/gcc/tree.c:12867
0xf1102b cgraph_node::create_indirect_edge(gcall*, int, profile_count, bool)
src/gcc/gcc/cgraph.c:990
0xf2113d cgraph_edge::rebuild_edges()
src/gcc/gcc/cgraphbuild.c:421
0xf214aa execute
src/gcc/gcc/cgraphbuild.c:490
Please submit a full bug report,
with preprocessed source if appropriate.
Please include the complete backtrace with any bug report.
See <https://gcc.gnu.org/bugs/> for instructions.
---

This is because for structural-equality types we call
get_odr_type:

---
  if (!in_lto_p && !TYPE_STRUCTURAL_EQUALITY_P (ret))
ret = TYPE_CANONICAL (ret);
  else
ret = get_odr_type (ret)->type;
---

and the hash table is empty at this point.  (We did previously
have a hash entry for the type, but that was freed during
pass_ipa_free_lang_data.)

The test does pass at -O2, but fails with
-fdump-tree-all-details -O2:

---
foo.c: In function ‘void f()’:
foo.c:3:19: internal compiler error: Segmentation fault
3 | void f() { x.f(); }
  |   ^
0x16cfec1 crash_signal
src/gcc/gcc/toplev.c:328
0x12a20f8 hash_table::find_slot_with_hash(tree_node* const&, unsigned int,
insert_option)
src/gcc/gcc/hash-table.h:967
0x129a622 get_odr_type(tree_node*, bool)
src/gcc/gcc/ipa-devirt.c:1939
0x129a4c9 obj_type_ref_class(tree_node const*)
src/gcc/gcc/ipa-devirt.c:1915
0x1afe3d7 virtual_method_call_p(tree_node const*)
src/gcc/gcc/tree.c:12867
0x180f086 dump_generic_node(pretty_printer*, tree_node*, int, dump_flag, bool)
src/gcc/gcc/tree-pretty-print.c:3125
...
---

I guess the fact that this routine is called by the dump routines
(which shouldn't affect codegen) suggests that we should cope
with null returns from get_odr_type, rather than passing "true"
as the insert parameter.  Is that right?

[Bug middle-end/95114] [9/10/11 Regression] ICE in obj_type_ref_class for structural-equality types

2020-05-13 Thread rsandifo at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95114

rsandifo at gcc dot gnu.org  changed:

   What|Removed |Added

 Status|UNCONFIRMED |ASSIGNED
   Target Milestone|--- |9.4
   Assignee|unassigned at gcc dot gnu.org  |rsandifo at gcc dot 
gnu.org
   Last reconfirmed||2020-05-13
 Ever confirmed|0   |1

--- Comment #1 from rsandifo at gcc dot gnu.org  
---
Testing a patch that tests for null returns.

[Bug target/92658] x86 lacks vector extend / truncate

2020-05-14 Thread rsandifo at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92658

--- Comment #12 from rsandifo at gcc dot gnu.org  
---
(In reply to rguent...@suse.de from comment #11)
> On Thu, 14 May 2020, ubizjak at gmail dot com wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92658
> > 
> > --- Comment #10 from Uroš Bizjak  ---
> > The patch is ready to be pushed, it is waiting for a decision what to do 
> > with
> > failed cases.
> > 
> > Richi, should this patch move forward (eventually XFAILing failed cases), 
> > or do
> > you plan to look at the fails from the generic vectorizer POV?
> 
> I think we should go forward with the patch, either XFAILing the testcases
> or splitting out the testcase (and backend patterns that do not get
> exercised due to the issue).
> 
> I've already said in comment#8 that the issue here is optabs working
> with modes and not vector types, so it's a bit hard to use the
> same mechanism to deal with the currently failing cases.  One
> possible route would be to add V4QImode similar to how we now
> do V2SFmode with SSE but of course where do we stop ...
> 
> As said we can try to tackle this incrementally.  Maybe Richard also
> has input here?
Nothing to add really, but: yeah, the idea was that the target
would provide smaller vector modes if it can efficiently load,
store and (at least) add them.  I think it would be good to do
that for aarch64 Advanced SIMD at some point.

There are three points behind using vector modes for this:

1. It gives the target the flexibility to decide how best to
   implement the operation (All at one end, and if so, which end?
   Or spread out evenly across the vector?)  On aarch64, the
   choice would differ between SVE and Advanced SIMD.

2. It means that __attribute__((vector_size)) vectors benefit.

3. The smaller modes can be used in conjunction with
   autovectorize_vector_modes to give the loop vectoriser
   a choice between operating on packed or unpacked vectors.
   This is mostly useful for VECT_COMPARE_COSTS but can help
   with low-trip-count loops even without.

So yeah, defining V4QI sounds like the way to go if the
architecture can process it efficiently.  I get the "where
do we stop" concern, but .md mode iterators should help.

[Bug target/94991] aarch64: ICE: Segmentation fault with option -mgeneral-regs-only

2020-05-18 Thread rsandifo at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94991

rsandifo at gcc dot gnu.org  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 CC||rsandifo at gcc dot gnu.org
 Status|UNCONFIRMED |RESOLVED

--- Comment #3 from rsandifo at gcc dot gnu.org  
---
Fixed on trunk.

[Bug target/95125] Unoptimal code for vectorized conversions

2020-05-22 Thread rsandifo at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95125

--- Comment #8 from rsandifo at gcc dot gnu.org  
---
(In reply to Richard Biener from comment #7)
> (In reply to Uroš Bizjak from comment #6)
> > (In reply to Hongtao.liu from comment #5)
> > > (In reply to Uroš Bizjak from comment #3)
> > > > It turns out that a bunch of patterns have to be renamed (and testcases
> > > > added).
> > > > 
> > > > Easyhack, waiting for someone to show some love to conversion patterns 
> > > > in
> > > > sse.md.
> > > 
> > > expander for floatv4siv4df2, fix_truncv4dfv4si2 already exists.
> > > 
> > > if change **float_double fix_double** to
> > > ---
> > > void
> > > float_double (void)
> > > {
> > > d[0] = i[0];
> > > d[1] = i[1];
> > > d[2] = i[2];
> > > d[3] = i[3];
> > > }
> > 
> > Hm, the above is vectorized, but the equivalent:
> > 
> > void
> > float_double (void)
> > {
> >   for (int n = 0; n < 4; n++)
> > d[n] = i[n];
> > }
> > 
> > is not?
> 
> Yes, we're committing to a too high VF here, likely because we pick the
> "wrong" vector mode too early.  We could eventually fix this up in
> the early vectype analysis.
It might be worth investigating VECT_COMPARE_COSTS, which weighs
the cost of different VFs against each other and is how SVE copes
with this.  I guess the danger is that it might interfere with
-mprefer-* options (although the first VF listed by
autovectorize_vector_modes wins in a tie).

[Bug target/95341] New: Poor vector_size decomposition when SVE is enabled

2020-05-26 Thread rsandifo at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95341

Bug ID: 95341
   Summary: Poor vector_size decomposition when SVE is enabled
   Product: gcc
   Version: 11.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: rsandifo at gcc dot gnu.org
  Target Milestone: ---
Target: aarch64*-*-*

Compiling the following with -O2 -march=armv8-a:

  typedef unsigned int uint_vec __attribute__((vector_size(32)));
  uint_vec f1(uint_vec x, uint_vec y) { return x + y; }

generates nice Advanced SIMD code:

ld1 {v4.16b - v5.16b}, [x0]
ld1 {v2.16b - v3.16b}, [x1]
add v0.4s, v4.4s, v2.4s
add v1.4s, v5.4s, v3.4s
st1 {v0.16b - v1.16b}, [x8]
ret

But compiling with -march=armv8.2-a+sve generates extremely bad
scalar code, so bad that I'll spare people's eyes by not quoting
it here.

I haven't yet analysed this or checked whether it's a regression.

[Bug target/95341] Poor vector_size decomposition when SVE is enabled

2020-05-26 Thread rsandifo at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95341

--- Comment #3 from rsandifo at gcc dot gnu.org  
---
(In reply to ktkachov from comment #2)
> Interestingly, -msve-vector-bits gives good codegen for 128, 256, 512 but
> bad for 1024, 2048 and VLA code
That's because addition is the one operation that we currently support
on unpacked vectors.  For subtraction it's a different story (just 128-bit
and 256-bit is good, as expected).

But the surprising thing for me was that we didn't fall back
to Advanced SIMD in VLA mode...

[Bug target/95361] New: Segfault when generating an epilogue for a partly-shrinked-wrapped SVE frame

2020-05-27 Thread rsandifo at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95361

Bug ID: 95361
   Summary: Segfault when generating an epilogue for a
partly-shrinked-wrapped SVE frame
   Product: gcc
   Version: 11.0
Status: UNCONFIRMED
  Keywords: ice-on-valid-code
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: rsandifo at gcc dot gnu.org
  Target Milestone: ---
Target: aarch64*-*-*

Compiling the following with -O2 -march=armv8.2+sve causes
the compiler to segfault during aarch64_expand_epilogue:


__SVInt8_t
f (__SVInt8_t x, int y)
{
  if (y == 1)
asm volatile ("" ::: "z8");
  if (y == 2)
asm volatile ("" ::: "z9");
  return x;
}


The problem is that we individually shrink-wrap the saves
and restores of z8 and z9, but need to keep the stack
allocation common to both arms.  Before emitting the
deallocation instruction, we try to add a REG_CFA_DEF_CFA
note to the final restore, which doesn't exist:

  if (callee_adjust != 0 || maybe_gt (initial_adjust, 65536))
{
  /* Emit delayed restores and set the CFA to be SP + initial_adjust.  */
  insn = get_last_insn ();
  rtx new_cfa = plus_constant (Pmode, stack_pointer_rtx, initial_adjust);
  REG_NOTES (insn) = alloc_reg_note (REG_CFA_DEF_CFA, new_cfa, cfi_ops);
  RTX_FRAME_RELATED_P (insn) = 1;
  cfi_ops = NULL;
}

This in practice only happens for SVE because:

(a) We don't try to shrink-wrap wb_candidate* registers even when
we've decided to treat them as normal saves and restores.
I have a fix for that.

(b) Even with (a) fixed, we're (almost?) guaranteed to emit a stack
tie for frames that are 64k or larger, so we end up hanging the
REG_CFA_DEF_CFA note on that instead.

I haven't yet checked how far back this goes.

[Bug target/95361] Segfault when generating an epilogue for a partly-shrinked-wrapped SVE frame

2020-05-27 Thread rsandifo at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95361

rsandifo at gcc dot gnu.org  changed:

   What|Removed |Added

   Last reconfirmed||2020-05-27
 Status|UNCONFIRMED |ASSIGNED
 Ever confirmed|0   |1
   Assignee|unassigned at gcc dot gnu.org  |rsandifo at gcc dot 
gnu.org

--- Comment #1 from rsandifo at gcc dot gnu.org  
---
Mine.

[Bug target/95459] aarch64: ICE in in aarch64_short_vector_p, at config/aarch64/aarch64.c:16803

2020-06-01 Thread rsandifo at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95459

rsandifo at gcc dot gnu.org  changed:

   What|Removed |Added

 CC||rsandifo at gcc dot gnu.org

--- Comment #1 from rsandifo at gcc dot gnu.org  
---
(In reply to Fei Yang from comment #0)
> -   gcc_assert (aarch64_sve_mode_p (mode));
> +   gcc_assert (TARGET_SVE ? aarch64_sve_mode_p (mode) : true);

Probably clearer as:

  !TARGET_SVE || aarch64_sve_mode_p (mode)

Might be worth adding a comment like:

  /* Leave later code to report an error if SVE is disabled.  */

[Bug target/95459] aarch64: ICE in aarch64_short_vector_p, at config/aarch64/aarch64.c:16803

2020-06-02 Thread rsandifo at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95459

rsandifo at gcc dot gnu.org  changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |FIXED

--- Comment #3 from rsandifo at gcc dot gnu.org  
---
Fixed.

[Bug target/95523] aarch64:ICE in register_tuple_type,at config/aarch64/aarch64-sve-builtins.cc:3434

2020-06-04 Thread rsandifo at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95523

rsandifo at gcc dot gnu.org  changed:

   What|Removed |Added

 CC||rsandifo at gcc dot gnu.org
 Status|UNCONFIRMED |WAITING
   Last reconfirmed||2020-06-04
 Ever confirmed|0   |1

--- Comment #1 from rsandifo at gcc dot gnu.org  
---
The reason for the assert is that the alignment is part of the
ABI of the types and is relied on when using LDR and STR for
some moves.  Even though -fpack-struct=N changes the ABI in general,
I don't think it should change it in this particular case.

I have to wonder why GCC even has -fpack-struct= though.  Do you have
a specific need for it, or was this caught by option coverage testing?

If there is no specific need, I'd be tempted to make -fpack-struct
unsupported for AArch64.  I think there are so many other things
that could go wrong if it is used.

[Bug target/95523] aarch64:ICE in register_tuple_type,at config/aarch64/aarch64-sve-builtins.cc:3434

2020-06-04 Thread rsandifo at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95523

rsandifo at gcc dot gnu.org  changed:

   What|Removed |Added

 Status|WAITING |NEW

--- Comment #3 from rsandifo at gcc dot gnu.org  
---
(In reply to z.zhanghaij...@huawei.com from comment #2)
> (In reply to rsand...@gcc.gnu.org from comment #1)
> > The reason for the assert is that the alignment is part of the
> > ABI of the types and is relied on when using LDR and STR for
> > some moves.  Even though -fpack-struct=N changes the ABI in general,
> > I don't think it should change it in this particular case.
> > 
> > I have to wonder why GCC even has -fpack-struct= though.  Do you have
> > a specific need for it, or was this caught by option coverage testing?
> > 
> > If there is no specific need, I'd be tempted to make -fpack-struct
> > unsupported for AArch64.  I think there are so many other things
> > that could go wrong if it is used.
> 
> We use the option in some projects. If this option is not supported, the
> impact is relatively large.

Ah, OK.  If it's an issue that affects real use-cases then
let's make it work.

> I think it can make the error reporting more
> friendly, so that this option can still be used normally without sve.
> 
> like:
> diff --git a/gcc/config/aarch64/aarch64-sve-builtins.cc
> b/gcc/config/aarch64/aarch64-sve-builtins.cc
> index bdb04e8170d..e93e766aba6 100644
> --- a/gcc/config/aarch64/aarch64-sve-builtins.cc
> +++ b/gcc/config/aarch64/aarch64-sve-builtins.cc
> @@ -3504,6 +3504,12 @@ handle_arm_sve_h ()
>return;
>  }
> 
> +  if (maximum_field_alignment)
> +{
> +  error ("SVE is incompatible with the use of %qs or %qs",
> "-fpack-struct", "#pragma pack");
> +  return;
> +}
> +
>sve_switcher sve;
> 
>/* Define the vector and tuple types.  */
> 
> Any suggestions?

Could you try setting DECL_USER_ALIGN on the FIELD_DECL?
that should (hopefully) force the field to keep its
natural alignment.

[Bug middle-end/95528] [10/11 Regression] internal compiler error: in emit_move_insn, at expr.c:3814

2020-06-05 Thread rsandifo at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95528

--- Comment #6 from rsandifo at gcc dot gnu.org  
---
(In reply to Jakub Jelinek from comment #4)
> I'd say the vectorizer/simplify_vector_constructor just shouldn't attempt to
> use these (e.g. vec_pack*, vec_unpack* optabs) for !VEC_MODE_P unless it is
> VECTOR_BOOLEAN_TYPE_P type.
> For i386 it would be the right thing as the patterns really assume that it
> is vector booleans and have their properties.
> Though, aarch64 seems to have vec_pack_trunc_di and vec_pack_trunc_df
> expanders, it is unclear to me what they are for and if they are really used.
> Other targets seem to only define these for vector modes.
Yeah, I agree those look odd.  The covering note for the patch
that added them was:

  https://gcc.gnu.org/pipermail/gcc-patches/2013-April/361636.html

which talks about fixing gcc.dg/vect failures.  But as James says,
only the 128-bit patterns should be needed for that.  Maybe the
the 64-bit patterns were just added for completeness.

Perhaps one justification for _di is that there is no V1DI mode.
Instead a vector of 1 DImode would itself have mode DImode.
So in principle, vec_pack_trunc_di is probably the right name
for a (V1)DI->V2SI truncate.

The same doesn't apply to _df since we don't use scalar float
modes for V1 vectors.  And (unlike at the time of the patch)
we now have V1DF.  So I agree that the _df one looks dead.

That said, for AArch64 we'd now try to mix 128-bit and 64-bit
vectors instead of vectorising with 2 64-bit vectors.  So the
_di pattern probably isn't useful in practice either.

In summary: from an AArch64 perspective, it's probably fine to
check !VECTOR_MODE_P || VECTOR_BOOLEAN_TYPE_P.  But given the V1
thing, maybe it would be better to add || m == GET_MODE_INNER (m)
as well (unless that defeats the fix).

[Bug middle-end/95528] [10/11 Regression] internal compiler error: in emit_move_insn, at expr.c:3814

2020-06-05 Thread rsandifo at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95528

--- Comment #7 from rsandifo at gcc dot gnu.org  
---
(In reply to rsand...@gcc.gnu.org from comment #6)
> In summary: from an AArch64 perspective, it's probably fine to
> check !VECTOR_MODE_P || VECTOR_BOOLEAN_TYPE_P.  But given the V1
> thing, maybe it would be better to add || m == GET_MODE_INNER (m)
> as well (unless that defeats the fix).
Er, I mean: m == TYPE_MODE (TREE_TYPE (vectype)) or whatever.
m == GET_MODE_INNER (m) is of course always true for scalars :-)

[Bug target/95254] aarch64: gcc generate inefficient code with fixed sve vector length

2020-06-05 Thread rsandifo at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95254

rsandifo at gcc dot gnu.org  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 CC||rsandifo at gcc dot gnu.org
 Status|UNCONFIRMED |RESOLVED

--- Comment #3 from rsandifo at gcc dot gnu.org  
---
Fixed.

[Bug middle-end/95528] [10/11 Regression] internal compiler error: in emit_move_insn, at expr.c:3814

2020-06-05 Thread rsandifo at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95528

--- Comment #9 from rsandifo at gcc dot gnu.org  
---
(In reply to Jakub Jelinek from comment #8)
> Created attachment 48683 [details]
> gcc11-pr95528.patch
> 
> Untested fix.
The VECTOR_TYPE_P condition should be redundant.
Looks good to me otherwise FWIW.

[Bug target/95523] aarch64:ICE in register_tuple_type,at config/aarch64/aarch64-sve-builtins.cc:3434

2020-06-10 Thread rsandifo at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95523

rsandifo at gcc dot gnu.org  changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

--- Comment #6 from rsandifo at gcc dot gnu.org  
---
Fixed.

[Bug tree-optimization/95570] vect: ICE: Segmentation fault in vect_loop_versioning

2020-06-12 Thread rsandifo at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95570

rsandifo at gcc dot gnu.org  changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED
 CC||rsandifo at gcc dot gnu.org

--- Comment #3 from rsandifo at gcc dot gnu.org  
---
Fixed.

[Bug tree-optimization/95694] [9/10/11 Regression] ICE in trunc_int_for_mode, at explow.c:59 since r9-7156-g33579b59aaf02eb7

2020-06-16 Thread rsandifo at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95694

rsandifo at gcc dot gnu.org  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |rsandifo at gcc dot 
gnu.org
 Ever confirmed|0   |1
 Status|UNCONFIRMED |ASSIGNED
   Last reconfirmed||2020-06-16

--- Comment #2 from rsandifo at gcc dot gnu.org  
---
OK.

It looks like the mode is readily available in each call to
REDUCE_BIT_FIELD, so I think we should just pass that down.
It would also avoid:

  if (target && GET_MODE (target) != GET_MODE (exp))
target = 0;

creating unnecessary temporaries for constant integers.

[Bug tree-optimization/95199] Remove extra variable created for memory reference in loop vectorization.

2020-06-17 Thread rsandifo at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95199

rsandifo at gcc dot gnu.org  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|UNCONFIRMED |RESOLVED
 CC||rsandifo at gcc dot gnu.org

--- Comment #11 from rsandifo at gcc dot gnu.org  
---
Fixed.

[Bug c++/95726] ICE with aarch64 __Float32x4_t as template argument

2020-06-17 Thread rsandifo at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95726

--- Comment #2 from rsandifo at gcc dot gnu.org  
---
(In reply to Jason Merrill from comment #1)
> Does the aarch64 port expect __Float32x4_t type to be considered equivalent
> to the GNU vector type or not?  If so, why use build_distinct_type_copy over
> build_variant_type_copy?  If not, they might want to set TYPE_INDIVISIBLE_P
> so that structural_comptypes treats them as different.
They mangle differently, and e.g.:

void f(float32x4_t);
void f(V);

aren't ODR equivalent.  But a lot of code relies on the GNU vector
extensions being available for float32x4_t as well as V, and on V
and float32x4_t being mutually and implicitly interconvertible.

[Bug c++/95726] ICE with aarch64 __Float32x4_t as template argument

2020-06-18 Thread rsandifo at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95726

rsandifo at gcc dot gnu.org  changed:

   What|Removed |Added

 Status|NEW |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |rsandifo at gcc dot 
gnu.org

[Bug c++/95726] ICE with aarch64 __Float32x4_t as template argument

2020-06-18 Thread rsandifo at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95726

--- Comment #4 from rsandifo at gcc dot gnu.org  
---
(In reply to Jakub Jelinek from comment #3)
> But if they mangle differently, then structural_comptypes shouldn't treat
> them as same types.
That certainly avoids the ICE, and makes GCC's behaviour consistent
with Clang for things like:

  typedef float vecf __attribute__((vector_size(16)));
  vecf x;
  float32x4_t &y = x;

Previously we accepted this, with the struct_comptypes change
we reject it (like Clang does).  But that might break existing
code, so I'm not sure it would be backportable.

I guess the question then is: what does TYPE_STRUCTURAL_EQUALITY_P
mean for VECTOR_TYPEs in the context of structural_comptypes?
And (if this is a different question) what case is that function's
VECTOR_TYPE handling for?  I.e. when do we want to return true for
a pair of VECTOR_TYPEs whose TYPE_MAIN_VARIANTs are different?

[Bug c++/95726] ICE with aarch64 __Float32x4_t as template argument

2020-06-22 Thread rsandifo at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95726

--- Comment #8 from rsandifo at gcc dot gnu.org  
---
Thanks for the pointers.  Putting the mangled name in a target-specific
attribute (like we do for SVE) seems to fix it.  It actually also keeps
the testcase in comment 4 “working”, which is unexpected (to me) and
seems a little dubious.  I guess it's good news for backporting though.

[Bug bootstrap/95805] [11 regression] gcc/recog.h:301:30: error: too many arguments to function call, expected 1, have 2

2020-06-22 Thread rsandifo at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95805

rsandifo at gcc dot gnu.org  changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |FIXED

--- Comment #2 from rsandifo at gcc dot gnu.org  
---
Fixed by g:20a2e6afa82c165bcd1a8749478e9402a25088a4.

[Bug target/95674] Unnecessary move when doing division-by-multiplication

2020-06-27 Thread rsandifo at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95674

rsandifo at gcc dot gnu.org  changed:

   What|Removed |Added

 CC||rsandifo at gcc dot gnu.org

--- Comment #4 from rsandifo at gcc dot gnu.org  
---
(In reply to Vladimir Makarov from comment #3)
> I looked at this problem.
> 
> All assignments are done in IRA (LRA does not change them).  We can not make
> a better assignment because scratches do not permit to store any preferences
> from instruction constraints (pseudo-registers permits to do this). So
> storing and getting this info for scratches is the first step to solving the
> problem.
> 
> LRA changes scratches to pseudo-registers to generate the correct code
> satisfying the insn constraints and turn them back to scratches when the
> corresponding pseudo-registers do not get hard registers.  Moving change of
> scratches to pseudo-regs from LRA to IRA could help but it is a big work.
Yeah, sounds useful!  Could it be done as a prepass before IRA proper?
(Maybe not -- haven't really thought about it much.)

> Another solution is to not use scratches in machine-descriptions and use
> pseudo-registers instead.  Scratches are bad and should be avoided as much
> as possible.  I expressed this several times.  Besides it is not possible to
> keep hard register preferences for them, they also can not be taken into
> account in conflict graph and this results in overoptimistic assignments
> which LRA has to correct.
I can see that scratches make life harder for the register allocators,
at least until the above change is made.  But they're very useful for
pre-RA passes.  They make it possible for any pass to use recog to match
a candidate instruction and have recog automatically add the required
scratches, without the passes having to cope with new registers being
created “spontaneously”.  (Combine is of course the main example,
but others like fwprop benefit too.)

Although it would be possible to use pseudos instead, I'm not sure
the end result would be better than what we have now:
- Each pass would need to cope with recog creating new registers
- We'd need some way of deferring the creation of pseudo registers
  for successful but speculative recogs that might later be rejected
  for cost reasons, or at least have some kind of recycling scheme.
- Even then, we'll have more pseudos before RA, and so most of the
  DF datastructures will be larger and perhaps sparser.

[Bug target/95958] New: [meta-bug] Inefficient arm_neon.h code for AArch64

2020-06-29 Thread rsandifo at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95958

Bug ID: 95958
   Summary: [meta-bug] Inefficient arm_neon.h code for AArch64
   Product: gcc
   Version: 11.0
Status: UNCONFIRMED
  Keywords: meta-bug
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: rsandifo at gcc dot gnu.org
  Target Milestone: ---
Target: aarch64*-*-*

This meta-bug is to collect AArch64 performance improvements that
require or have a strong relationship with arm_neon.h intrinsics.

This PR overlaps to some extent with PR47562, but that PR is more
general in two respects: it covers general arm_neon.h improvements,
and it covers both AArch32 and AArch64.

[Bug target/95958] [meta-bug] Inefficient arm_neon.h code for AArch64

2020-06-29 Thread rsandifo at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95958

rsandifo at gcc dot gnu.org  changed:

   What|Removed |Added

 Ever confirmed|0   |1
   Last reconfirmed||2020-06-29
 Status|UNCONFIRMED |NEW

[Bug target/95958] [meta-bug] Inefficient arm_neon.h code for AArch64

2020-06-29 Thread rsandifo at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95958

rsandifo at gcc dot gnu.org  changed:

   What|Removed |Added

   Keywords||missed-optimization
 Depends on||66675, 80283, 88212, 89057,
   ||89606, 89967, 91753, 94442,
   ||95265, 91598, 82074

--- Comment #1 from rsandifo at gcc dot gnu.org  
---
At the time of writing, PR80283 doesn't have an AArch64 testcase,
only an AArch32 one.  However, the underlying issue applies across
targets.  The content probably overlaps a lot with PR91598, which
is AArch64-specific.


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66675
[Bug 66675] Could improve vector lane folding style operations.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80283
[Bug 80283] [8/9/10/11 Regression] bad SIMD register allocation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82074
[Bug 82074] [aarch64] vmlsq_f32 compiled into 2 instructions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88212
[Bug 88212] IRA Register Coalescing not working for the testcase
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89057
[Bug 89057] [8/9/10/11 Regression] AArch64 ld3 st4 less optimized
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89606
[Bug 89606] Extra mov after structure load instructions on aarch64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89967
[Bug 89967] Inefficient code generation for vld2q_lane_u8 under aarch64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91598
[Bug 91598] [8/9 regression] 60% speed drop on neon intrinsic loop
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91753
[Bug 91753] Bad register allocation of multi-register types
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94442
[Bug 94442] [10/11 regression] Redundant loads/stores emitted at -O3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95265
[Bug 95265] aarch64: suboptimal code generation for common neon intrinsic
sequence involving shrn and mull

[Bug target/95962] New: Inefficient code for simple arm_neon.h iota operation

2020-06-29 Thread rsandifo at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95962

Bug ID: 95962
   Summary: Inefficient code for simple arm_neon.h iota operation
   Product: gcc
   Version: 11.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: rsandifo at gcc dot gnu.org
Blocks: 95958
  Target Milestone: ---
Target: aarch64*-*-*

For:

#include 

int32x4_t
foo (void)
{
  int32_t array[] = { 0, 1, 2, 3 };
  return vld1q_s32 (array);
}

we produce:

foo:
.LFB4217:
.cfi_startproc
sub sp, sp, #16
.cfi_def_cfa_offset 16
mov x0, 2
mov x1, 4294967296
movkx0, 0x3, lsl 32
stp x1, x0, [sp]
ldr q0, [sp]
add sp, sp, 16
.cfi_def_cfa_offset 0
ret

In contrast, clang produces essentially perfect code:

adrpx8, .LCPI0_0
ldr q0, [x8, :lo12:.LCPI0_0]
ret

I think the problem is a combination of two things:

- __builtin_aarch64_ld1v4si & co. are treated as general
  functions rather than pure functions, so in principle
  it could write to the given address.  This stops us
  promoting the array to a constant.

- The loads could be reduced to native gimple-level
  operations, at least on little-endian targets.

IMO this a bug rather than an enhancement.  Intrinsics only
exist to optimise code, and what GCC is doing falls short
of what users should reasonably expect.


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95958
[Bug 95958] [meta-bug] Inefficient arm_neon.h code for AArch64

[Bug target/95964] New: AArch64 arm_neon.h arithmetic functions lack appropriate attributes

2020-06-29 Thread rsandifo at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95964

Bug ID: 95964
   Summary: AArch64 arm_neon.h arithmetic functions lack
appropriate attributes
   Product: gcc
   Version: 11.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: rsandifo at gcc dot gnu.org
Blocks: 95958
  Target Milestone: ---
Target: aarch64*-*-*

For:

---
#include 
#include 

std::vector a, b, c;

void
foo (size_t n)
{
  for (size_t i = 0; i < n; ++i)
a[i] = vfmaq_f32(a[i], b[i], c[i]);
}
---

we generate code that loads the start of a, b and c
in every iteration of the loop:

---
.cfi_startproc
cbz x0, .L4
adrpx3, .LANCHOR0
add x3, x3, :lo12:.LANCHOR0
mov x2, 0
.p2align 3,,7
.L6:
ldr x4, [x3]
lsl x1, x2, 4
ldr x6, [x3, 24]
add x2, x2, 1
ldr x5, [x3, 48]
ldr q0, [x4, x1]
ldr q2, [x6, x1]
ldr q1, [x5, x1]
fmlav0.4s, v2.4s, v1.4s
str q0, [x4, x1]
cmp x0, x2
bne .L6
.L4:
ret
.cfi_endproc
---

The problem is that __builtin_aarch64_fmav4sf and similar
operations are treated as general functions that can read
memory, write memory, and call other functions.  If the
intrinsic is replaced by arithmetic then the start addresses
are hoisted, as expected.


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95958
[Bug 95958] [meta-bug] Inefficient arm_neon.h code for AArch64

[Bug target/95964] AArch64 arm_neon.h arithmetic functions lack appropriate attributes

2020-06-29 Thread rsandifo at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95964

--- Comment #2 from rsandifo at gcc dot gnu.org  
---
(In reply to Richard Biener from comment #1)
> You could use fnspec attributes to improve things but of course open-coding
> those as GIMPLE is preferable (last resort is to "fold" the calls to
> GIMPLE sequences as powerpc does for select builtins).
Yeah, open-coding is another option, if the operation and
command-line options are right.  Part of the problem though
is that the intrinsics are supposed to behave like the associated
instructions, including in terms of honouring the rounding mode,
etc.  They're also not supposed to inherit C's idea of what's
undefined behaviour for the closest “equivalent” operators or
libm functions.

So yeah, I think adding attributes is the way to go in general.

[Bug target/95967] New: Poor aarch64 vector constructor code when using arm_neon.h

2020-06-29 Thread rsandifo at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95967

Bug ID: 95967
   Summary: Poor aarch64 vector constructor code when using
arm_neon.h
   Product: gcc
   Version: 11.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: rsandifo at gcc dot gnu.org
Depends on: 95962
Blocks: 95958
  Target Milestone: ---
Target: aarch64*-*-*

We generate poor code for the attached functions:

f1:
moviv4.4s, 0
ins v4.s[0], v0.s[0]
ins v4.s[1], v1.s[0]
ins v4.s[2], v2.s[0]
mov v0.16b, v4.16b
ins v0.s[3], v3.s[0]
ret

f2:
dup v0.4s, v0.s[0]
ins v0.s[1], v1.s[0]
ins v0.s[2], v2.s[0]
ins v0.s[3], v3.s[0]
ret

f3:
sub sp, sp, #16
stp s0, s1, [sp]
stp s2, s3, [sp, 8]
ldr q0, [sp]
add sp, sp, 16
ret

g1:
moviv0.4s, 0
ld1 {v0.s}[0], [x0]
ld1 {v0.s}[1], [x1]
ld1 {v0.s}[2], [x2]
ld1 {v0.s}[3], [x3]
ret

g2:
ld1r{v0.4s}, [x0]
ld1 {v0.s}[1], [x1]
ld1 {v0.s}[2], [x2]
ld1 {v0.s}[3], [x3]
ret

g3:
sub sp, sp, #16
ldr s0, [x3]
ldr s3, [x0]
ldr s2, [x1]
ldr s1, [x2]
stp s3, s2, [sp]
stp s1, s0, [sp, 8]
ldr q0, [sp]
add sp, sp, 16
ret

All three f functions should generate:

mov v0.s[1], v1.s[0]
mov v0.s[2], v2.s[0]
mov v0.s[3], v3.s[0]
ret

and all three g functions should generate:

ldr s0, [x0]
ld1 { v0.s }[1], [x1]
ld1 { v0.s }[2], [x2]
ld1 { v0.s }[3], [x3]
ret

which is what current Clang does.

Getting the right code for f3 and g3 depends on the fix for PR95962.
There's a reasonable chance that PR95962 will be enough on its own
to fix f3 and g3, but I included them just in case it isn't.


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95958
[Bug 95958] [meta-bug] Inefficient arm_neon.h code for AArch64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95962
[Bug 95962] Inefficient code for simple arm_neon.h iota operation

[Bug target/95967] Poor aarch64 vector constructor code when using arm_neon.h

2020-06-29 Thread rsandifo at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95967

--- Comment #1 from rsandifo at gcc dot gnu.org  
---
Created attachment 48802
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=48802&action=edit
6 constructor functions

This time with attachment -- not sure what happened last time.

[Bug target/95969] New: Use of __builtin_aarch64_im_lane_boundsi in AArch64 arm_neon.h interferes with gimple optimisation

2020-06-29 Thread rsandifo at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95969

Bug ID: 95969
   Summary: Use of __builtin_aarch64_im_lane_boundsi in AArch64
arm_neon.h interferes with gimple optimisation
   Product: gcc
   Version: 11.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: rsandifo at gcc dot gnu.org
Blocks: 95958
  Target Milestone: ---
Target: aarch64*-*-*

For:


#include 

void
f (float32x4_t **ptr)
{
  float32x4_t res = vsetq_lane_f32 (0.0f, **ptr, 0);
  **ptr = res;
}


the final gimple .optimized dump looks like:


  float32x4_t __vec;
  float32x4_t * _1;
  __Float32x4_t _2;
  float32x4_t * _3;

   [local count: 1073741824]:
  _1 = *ptr_5(D);
  _2 = *_1;
  __builtin_aarch64_im_lane_boundsi (16, 4, 0);
  __vec_8 = BIT_INSERT_EXPR <_2, 0.0, 0>;
  _3 = *ptr_5(D);
  *_3 = __vec_8;
  return;


where we still have two loads from *ptr.  This is because
__builtin_aarch64_im_lane_boundsi has no attributes:
it's modelled a general function that could do pretty
much anything.

Although we fix this testcase in the RTL optimisers, it's easy
for the issue to cause unoptimised code in larger, more realistic
testcases.

The problem is similar to PR95964.  The difficulty is that
here we have to model the function as having some kind of
side-effect, otherwise it will simply get optimised away.

Ideally we'd fix this by implementing the intrinsics directly
in the compiler and doing the checks in the frontend via
TARGET_CHECK_BUILTIN_CALL.  That's obviously a big change
though.  Until then, we should optimise away calls whose
arguments are already correct so that they don't clog
up the IL.

If not returning a value makes it harder to fold the call
for some reason, perhaps an alternative would be to pass
a vector value through a dummy const function, e.g.:

  _4 = __builtin_aarch64_... (_2, 16, 4, 0);
  __vec_8 = BIT_INSERT_EXPR <_4, 0.0, 0>;

That might not be necessary though -- haven't checked.


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95958
[Bug 95958] [meta-bug] Inefficient arm_neon.h code for AArch64

[Bug target/95974] New: AArch64 arm_neon.h stores interfere with gimple optimisations

2020-06-29 Thread rsandifo at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95974

Bug ID: 95974
   Summary: AArch64 arm_neon.h stores interfere with gimple
optimisations
   Product: gcc
   Version: 11.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: rsandifo at gcc dot gnu.org
Blocks: 95958
  Target Milestone: ---
Target: aarch64*-*-*

For:

---
#include 
#include 

std::vector a;

void
f (size_t n, float32x4_t v)
{
  for (size_t i = 0; i < n; i += 4)
vst1q_f32 (&a[i], v);
}
---

we generate code that loads the start address of
"a" in every iteration of the loop:

---
cbz x0, .L4
adrpx4, .LANCHOR0
add x4, x4, :lo12:.LANCHOR0
mov x1, 0
.p2align 3,,7
.L6:
ldr x3, [x4]
lsl x2, x1, 2
add x1, x1, 4
str q0, [x3, x2]
cmp x0, x1
bhi .L6
.L4:
ret
---

This is really the store equivalent of PR95962.  The problem is
that __builtin_aarch64_st1v4sf is modelled as a general function
that could read and write from arbitrary memory.  As with PR95962,
one option would be to lower to gimple accesses where possible,
at least for little-endian targets.


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95958
[Bug 95958] [meta-bug] Inefficient arm_neon.h code for AArch64

[Bug c++/92789] Non-obvious ?: behaviour with structurally equivalent types

2020-06-30 Thread rsandifo at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92789

--- Comment #5 from rsandifo at gcc dot gnu.org  
---
Keeping open since the problem also affects arm*-*-*.  I don't
think we should change GCC 10 and earlier though.

[Bug c++/95726] ICE with aarch64 __Float32x4_t as template argument

2020-06-30 Thread rsandifo at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95726

rsandifo at gcc dot gnu.org  changed:

   What|Removed |Added

 Target|aarch64-linux   |aarch64*-*-* arm*-*-*

--- Comment #10 from rsandifo at gcc dot gnu.org  
---
Fixed on trunk for aarch64*-*-* only.  Still needs fixing for arm*-*-*
on trunk, and for both targets on branches.

[Bug tree-optimization/95961] ICE: in exact_div, at poly-int.h:2182

2020-07-02 Thread rsandifo at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95961

rsandifo at gcc dot gnu.org  changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |FIXED

--- Comment #2 from rsandifo at gcc dot gnu.org  
---
Fixed.

[Bug tree-optimization/96075] [8/9/10/11 Regression] bogus alignment for negative step grouped access

2020-07-06 Thread rsandifo at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96075

--- Comment #4 from rsandifo at gcc dot gnu.org  
---
(In reply to Richard Biener from comment #3)
> So we end up calling get_negative_load_store_type for this group which seems
> to only handle contiguous accesses but this one is single element
> interleaving
> aka contiguous with gap.
> 
> vect_supportable_dr_alignment returns dr_aligned which is seemingly OK for
> 
> #(Data Ref: 
> #  bb: 3 
> #  stmt: _4 = y[_2];
> #  ref: y[_2];
> #  base_object: y;
> #  Access function 0: {1022, +, -2}_1
Are you sure we support this?  I think…

>   /* If this is a backward running DR then first access in the larger
>  vectype actually is N-1 elements before the address in the DR.
>  Adjust misalign accordingly.  */
>   if (tree_int_cst_sgn (step) < 0)
> { 
>   tree offset = ssize_int (TYPE_VECTOR_SUBPARTS (vectype) - 1);
>   /* DR_STEP(dr) is the same as -TYPE_SIZE of the scalar type,
>  otherwise we wouldn't be here.  */
…really was the assumption for negative steps at one time,
and I'm not sure off-hand when/if that changed.

(Of course, it might be that one of my patches changed it.)

[Bug tree-optimization/96091] ICE during dom: tree check: expected integer_cst, have poly_int_cst in to_wide, at tree.h:5911

2020-07-07 Thread rsandifo at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96091

--- Comment #3 from rsandifo at gcc dot gnu.org  
---
(In reply to Richard Biener from comment #2)
> Why should we not have a VECTOR_CST of POLY_INT_CST elements?  If
> POLY_INT_CST
> is not "constant" then it shouldn't be tcc_constant?  Looks like
> 
> tree
> vector_cst_elt (const_tree t, unsigned int i)
> {
> ...
>   /* Otherwise work out the value from the last two encoded elements.  */
>   return wide_int_to_tree (TREE_TYPE (TREE_TYPE (t)),
>vector_cst_int_elt (t, i));
> }
> 
> should be using poly-ints and not wide-ints.  Richard?
Yeah, looks like it.  I think I was worried about cases
in which we could end up with poly_int*poly_int, from a
poly_int-long vector containing a poly_int “stepped”
vector constant.  But that's not a problem here, since
the index is always a plain integer.

[Bug target/95105] Bogus reference-compatibility error for arm_sve_vector_bits

2020-07-08 Thread rsandifo at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95105

--- Comment #3 from rsandifo at gcc dot gnu.org  
---
Fixed.

[Bug target/95105] Bogus reference-compatibility error for arm_sve_vector_bits

2020-07-08 Thread rsandifo at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95105

rsandifo at gcc dot gnu.org  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #4 from rsandifo at gcc dot gnu.org  
---
…and closed.

  1   2   3   4   5   6   7   8   9   10   >