[Bug target/89213] Optimize V2DI shifts by a constant on power8 & above systems.

2024-10-07 Thread meissner at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89213

Michael Meissner  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #10 from Michael Meissner  ---
Patch committed on September 18th, 2024.

[Bug middle-end/116997] [13/14/15 Regression] Wrong bitfield accesses since r13-3219-g25413fdb2ac249

2024-10-07 Thread avieira at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116997

--- Comment #1 from avieira at gcc dot gnu.org ---
Had a look at this and I see similar codegen for aarch64 when compiling for
big-endian.

If I disable tree-ifcvt (-fdisable-tree-ifvt) I end up with:
MEM  [(void *)Ptr.0_1] = 30071062528; 

Which is the behaviour we want to see. This is achieved by store-merging, we
should have a look at how that pass handles this.

When ifcvt is enabled, lower_bitfield generates:
_ifc__24 = Ptr.0_1->D.4418;
_ifc__25 = BIT_INSERT_EXPR <_ifc__24, 3, 0 (11 bits)>;


That gets optimized to:
Ptr.0_1->D.4418 = 3; 
Whereas I expected that to get big-endiannised (not a word I know) to =
0x60.

I was also surprised to see that the front-end already transforms:
 'if (GlobS.f2 != 3)'  into  '(BIT_FIELD_REF  & 4292870144) !=
6291456' 

Anyway that's as far as I got, not sure what the right solution is, should
BIT_INSERT_EXPR <_ifc__24, 3, 0 (11 bits)> not fold to 0x6 ?

[Bug middle-end/117003] New: pr104783.c is miscompiled with offloading and results in segmentation fault during host-only execution for -O1 and above

2024-10-07 Thread prathamesh3492 at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117003

Bug ID: 117003
   Summary: pr104783.c is miscompiled with offloading and results
in segmentation fault during host-only execution for
-O1 and above
   Product: gcc
   Version: 15.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: middle-end
  Assignee: unassigned at gcc dot gnu.org
  Reporter: prathamesh3492 at gcc dot gnu.org
  Target Milestone: ---

Hi,
The following test (libgomp/pr104783.c):

int
main (void)
{
  unsigned val = 0;

#pragma omp target map(tofrom: val)
#pragma omp simd
  for (int i = 0 ; i < 1 ; i++)
{
#pragma omp atomic update
  val = val + 1;
}

  if (val != 1)
__builtin_abort ();

  return 0;
}

Compiling with -O1 -fopenmp -foffload=nvptx-none and forcing host-only
execution results in segfault.

The issue here seems to be during omp lowering.

  D.4569 = .GOMP_USE_SIMT ();
  if (D.4569 != 0) goto ; else goto ;
  :
  {
void * simduid.2;
void * .omp_simt.3;
int i;

simduid.2 = .GOMP_SIMT_ENTER (simduid.2);
.omp_simt.3 = .GOMP_SIMT_ENTER_ALLOC (simduid.2);
#pragma omp simd _simduid_(simduid.2) _simt_ linear(i:1)
for (i = 0; i < 1; i = i + 1)
D.4577 = .omp_data_i->val;
#pragma omp atomic_load relaxed
  D.4558 = *D.4577
D.4559 = D.4558 + 1;
#pragma omp atomic_store relaxed (D.4559)
#pragma omp continue (i, i)
.GOMP_SIMT_EXIT (.omp_simt.3);
#pragma omp return(nowait)
  }
 goto ;
 :
  {
int i;

#pragma omp simd linear(i:1)
for (i = 0; i < 1; i = i + 1)
#pragma omp atomic_load relaxed
  D.4558 = *&*D.4577
D.4559 = D.4558 + 1;
#pragma omp atomic_store relaxed (D.4559)
#pragma omp continue (i, i)
#pragma omp return(nowait)
  }
 :
 ...

In the following stmt in simd code-path:
  #pragma omp atomic_load relaxed
D.4558 = *&*D.4577

D.4577 is uninitialized. D.4577 is initialized in the sibling
if-block containing simt code-path:
D.4577 = .omp_data_i->val;

which doesn't reach the use in atomic_load relaxed stmt,
and gets expanded to following in ompexp dump:
   :
  __atomic_fetch_add_4 (D.4590, 1, 0);

and thus we end up passing uninitialized pointer to __atomic_fetch_add_4, which
results in segfault.

While in the corresponding simt code-path, it's initialized correctly:
   :
  D.4590 = .omp_data_i->val;
  __atomic_fetch_add_4 (D.4590, 1, 0);

Thanks,
Prathamesh

[Bug middle-end/116997] [13/14/15 Regression] Wrong bitfield accesses since r13-3219-g25413fdb2ac249

2024-10-07 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116997

Andrew Pinski  changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
 Ever confirmed|0   |1
   Last reconfirmed||2024-10-07

--- Comment #2 from Andrew Pinski  ---
>I was also surprised to see that the front-end already transforms:
 'if (GlobS.f2 != 3)'  into  '(BIT_FIELD_REF  & 4292870144) !=
6291456' 


That is from fold. Specifically optimize_bit_field_compare .

[Bug target/80881] Implement Windows native TLS

2024-10-07 Thread tanksherman27 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80881

--- Comment #25 from Julian Waters  ---
Created attachment 59290
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=59290&action=edit
Newer patch for TLS support, incomplete

[Bug middle-end/116997] [13/14/15 Regression] Wrong bitfield accesses since r13-3219-g25413fdb2ac249

2024-10-07 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116997

--- Comment #3 from Andrew Pinski  ---
https://gcc.gnu.org/pipermail/gcc-patches/2020-January/537612.html

I can't remember if the fix was committed or not ...

[Bug d/117002] New: lifetime.d: In function ‘_d_newclassT’: error: size of array element is not a multiple of its alignment with -Warray-bounds and -O2

2024-10-07 Thread a.horodniceanu at proton dot me via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117002

Bug ID: 117002
   Summary: lifetime.d: In function ‘_d_newclassT’: error: size of
array element is not a multiple of its alignment with
-Warray-bounds and -O2
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: d
  Assignee: ibuclaw at gdcproject dot org
  Reporter: a.horodniceanu at proton dot me
  Target Milestone: ---

Reported first at https://bugs.gentoo.org/940750 and reduced to:
--
module object;

extern(C++) class Foo {
ubyte[4] not_multiple_of_8;
}

extern(C)
int main () {
// avoid optimizations
void* p = cast(void*)(0xdeadbeef);
auto init = __traits(initSymbol, Foo);

p[0 .. init.length] = init[];
return 0;
}
--
Compile with:
--
$ gdc repro.d -Warray-bounds -O2
/root/repro.d: In function ‘main’:
/root/repro.d:8:5: error: size of array element is not a multiple of its
alignment
8 | int main () {
  | ^
--

The code is a reduction of _d_newclassT in core/lifetime.d with the original
error being:
--
$ cat repro.d
class Foo {
ubyte[4] not_multiple_of_8;
}

void foo () {
new Foo();
}
$ gdc repro.d -Warray-bounds -O2
/usr/lib/gcc/x86_64-pc-linux-gnu/13/include/d/core/lifetime.d: In function
‘_d_newclassT’:
/usr/lib/gcc/x86_64-pc-linux-gnu/13/include/d/core/lifetime.d:2725:3: error:
size of array element is not a multiple of its alignment
 2725 | T _d_newclassT(T)() @trusted
  |   ^
--

[Bug c/116735] ICE in build_counted_by_ref

2024-10-07 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116735

--- Comment #4 from GCC Commits  ---
The master branch has been updated by Qing Zhao :

https://gcc.gnu.org/g:9a17e6d03c6ed53e3b2dfd2c3ff9b1066ffa97b9

commit r15-4122-g9a17e6d03c6ed53e3b2dfd2c3ff9b1066ffa97b9
Author: qing zhao 
Date:   Mon Sep 30 18:29:29 2024 +

c: ICE in build_counted_by_ref [PR116735]

When handling the counted_by attribute, if the corresponding field
doesn't exit, in additiion to issue error, we should also remove
the already added non-existing "counted_by" attribute from the
field_decl.

PR c/116735

gcc/c/ChangeLog:

* c-decl.cc (verify_counted_by_attribute): Remove the attribute
when error.

gcc/testsuite/ChangeLog:

* gcc.dg/flex-array-counted-by-9.c: New test.

[Bug middle-end/117002] lifetime.d: In function ‘_d_newclassT’: error: size of array element is not a multiple of its alignment with -Warray-bounds and -O2

2024-10-07 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117002

Andrew Pinski  changed:

   What|Removed |Added

   Last reconfirmed||2024-10-07
 Ever confirmed|0   |1
   See Also||https://gcc.gnu.org/bugzill
   ||a/show_bug.cgi?id=116481
 Status|UNCONFIRMED |NEW

--- Comment #1 from Andrew Pinski  ---
100% related to PR 116481 which is about arrays of function types.

[Bug tree-optimization/116583] vectorizable_slp_permutation cannot handle even/odd extract from VLA vector

2024-10-07 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116583

--- Comment #13 from GCC Commits  ---
The trunk branch has been updated by Richard Sandiford :

https://gcc.gnu.org/g:2abd04d01bc4e18158c785e75c91576b836f3ba6

commit r15-4113-g2abd04d01bc4e18158c785e75c91576b836f3ba6
Author: Richard Sandiford 
Date:   Mon Oct 7 13:03:04 2024 +0100

vect: Restructure repeating_p case for SLP permutations [PR116583]

The repeating_p case previously handled the specific situation
in which the inputs have N lanes and the output has N lanes,
where N divides the number of vector elements.  In that case,
every output uses the same permute vector.

The code was therefore structured so that the outer loop only
constructed one permute vector, with an inner loop generating
as many VEC_PERM_EXPRs from it as required.

However, the main patch for PR116583 adds support for cycling
through N permute vectors, rather than just having one.
The current structure doesn't really handle that case well.
(We'd need to interleave the results after generating them,
which sounds a bit fragile.)

This patch instead makes the transform phase calculate each output
vector's permutation explicitly, like for the !repeating_p path.
As a bonus, it gets rid of one use of SLP_TREE_NUMBER_OF_VEC_STMTS.

This arguably undermines one of the justifications for using repeating_p
for constant-length vectors: that the repeating_p path involved less
work than the !repeating_p path.  That justification does still hold for
the analysis phase, though, and that should be the more time-sensitive
part.  And the other justification -- to get more coverage of the code --
still applies.  So I'd prefer that we continue to use repeating_p for
constant-length vectors unless that causes a known missed optimisation.

gcc/
PR tree-optimization/116583
* tree-vect-slp.cc (vectorizable_slp_permutation_1): Remove
the noutputs_per_mask inner loop and instead generate a
separate permute vector for each output.

[Bug tree-optimization/116583] vectorizable_slp_permutation cannot handle even/odd extract from VLA vector

2024-10-07 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116583

--- Comment #11 from GCC Commits  ---
The trunk branch has been updated by Richard Sandiford :

https://gcc.gnu.org/g:1048ebbbdc98a5928a974356d7f4244603b6bd32

commit r15-4110-g1048ebbbdc98a5928a974356d7f4244603b6bd32
Author: Richard Sandiford 
Date:   Mon Oct 7 13:03:02 2024 +0100

aarch64: Handle SVE modes in aarch64_evpc_reencode [PR116583]

For Advanced SIMD modes, aarch64_evpc_reencode tests whether
a permute in a narrow element mode can be done more cheaply
in a wider mode.  For example, { 0, 1, 8, 9, 4, 5, 12, 13 }
on V8HI is a natural TRN1 on V4SI ({ 0, 4, 2, 6 }).

This patch extends the code to handle SVE data and predicate
modes as well.  This is a prerequisite to getting good results
for PR116583.

gcc/
PR target/116583
* config/aarch64/aarch64.cc (aarch64_coalesce_units): New function,
extending the Advanced SIMD handling from...
(aarch64_evpc_reencode): ...here to SVE data and predicate modes.

gcc/testsuite/
PR target/116583
* gcc.target/aarch64/sve/permute_1.c: New test.
* gcc.target/aarch64/sve/permute_2.c: Likewise.
* gcc.target/aarch64/sve/permute_3.c: Likewise.
* gcc.target/aarch64/sve/permute_4.c: Likewise.

[Bug tree-optimization/116583] vectorizable_slp_permutation cannot handle even/odd extract from VLA vector

2024-10-07 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116583

--- Comment #12 from GCC Commits  ---
The trunk branch has been updated by Richard Sandiford :

https://gcc.gnu.org/g:1732298d51028ae50a802e538df5d7249556255d

commit r15-4112-g1732298d51028ae50a802e538df5d7249556255d
Author: Richard Sandiford 
Date:   Mon Oct 7 13:03:03 2024 +0100

vect: Variable lane indices in vectorizable_slp_permutation_1 [PR116583]

The main patch for PR116583 needs to create variable indices into
an input vector.  This pre-patch changes the types to allow that.

There is no pretty-print format for poly_uint64 because of issues
with passing C++ objects through "...".

gcc/
PR tree-optimization/116583
* tree-vect-slp.cc (vectorizable_slp_permutation_1): Using
poly_uint64 for scalar lane indices.

[Bug tree-optimization/116583] vectorizable_slp_permutation cannot handle even/odd extract from VLA vector

2024-10-07 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116583

--- Comment #14 from GCC Commits  ---
The trunk branch has been updated by Richard Sandiford :

https://gcc.gnu.org/g:8157f3f2d211bfbf53fbf8dd209b47ce583f4142

commit r15-4114-g8157f3f2d211bfbf53fbf8dd209b47ce583f4142
Author: Richard Sandiford 
Date:   Mon Oct 7 13:03:04 2024 +0100

vect: Support more VLA SLP permutations [PR116583]

This is the main patch for PR116583.  Previously, we only
supported VLA SLP permutations for which the output and inputs
have the same number of lanes, and for which that number of
lanes divides the number of vector elements.

The patch extends this to handle:

(1) "packs" of a single 2N-vector input into an N-vector output
(2) "unpacks" of N-vector inputs into an XN-vector output

Hopefully the comments in the code explain the approach.

The contents of the:

  for (unsigned i = 0; i < ncopies; ++i)

loop do not change; the patch simply adds an outer loop around it.

The patch removes the XFAIL in slp-13.c and also improves
the SVE vect.exp results with vect-force-slp=1.  I haven't
added new tests specifically for this, since presumably the
existing ones will cover it once the SLP switch is flipped.

gcc/
PR tree-optimization/116583
* tree-vect-slp.cc (vectorizable_slp_permutation_1): Handle
variable-length pack and unpack permutations.

gcc/testsuite/
PR tree-optimization/116583
* gcc.dg/vect/slp-13.c: Remove xfail for vect_variable_length.
* gcc.dg/vect/slp-13-big-array.c: Likewise.

[Bug tree-optimization/116583] vectorizable_slp_permutation cannot handle even/odd extract from VLA vector

2024-10-07 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116583

--- Comment #15 from GCC Commits  ---
The trunk branch has been updated by Richard Sandiford :

https://gcc.gnu.org/g:03299164830e19405b35a5fa862e248df4ea01e2

commit r15-4115-g03299164830e19405b35a5fa862e248df4ea01e2
Author: Richard Sandiford 
Date:   Mon Oct 7 13:03:05 2024 +0100

vect: Add more dump messages for VLA SLP permutation [PR116583]

Taking the !repeating_p route for VLA vectors causes analysis
to fail, but it wasn't clear from the dump files when this
had happened, and which node caused it.

gcc/
PR tree-optimization/116583
* tree-vect-slp.cc (vectorizable_slp_permutation_1): Add more
dump messages.

[Bug c/116735] ICE in build_counted_by_ref

2024-10-07 Thread qinzhao at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116735

qinzhao at gcc dot gnu.org changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #5 from qinzhao at gcc dot gnu.org ---
fixed into GCC15

[Bug ada/114964] Ada Address_To_Access_Conversions gnat_to_gnu_entity internal error

2024-10-07 Thread ken at pegasoft dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114964

Ken Burtch  changed:

   What|Removed |Added

 Resolution|--- |WONTFIX
 Status|WAITING |RESOLVED

--- Comment #6 from Ken Burtch  ---
You're statement does not make sense to me, as the bug is clearly reproducible
but your process is not adequate.

I will close this bug report.

[Bug middle-end/116983] counted_by not used to identify singleton pointers

2024-10-07 Thread jakub at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116983

--- Comment #3 from Jakub Jelinek  ---
Plus the useless pointer conversions in GIMPLE can mean that
void *foo (int);
struct counted {
int counter;
int array[] __attribute__((counted_by(counter)));
};
struct notcounted {
int counter;
int array[];
};
int
bar (int x)
{
  void *p = foo (x);
  if (x & 1)
{
  struct counted *q = (struct counted *) p;
  use (q);
  return __builtin_dynamic_object_size (q, 0);
}
  else
{
  struct notcounted *r = (struct notcounted *) p;
  use2 (r);
  return __builtin_dynamic_object_size (r, 0);
}
}
with CSE/SCCVN whether one gets actually struct counted * or struct notcounted
* or void * pointer type is a lottery.

[Bug c++/117004] Unexpected const variable type with decltype of non-type template parameter of deduced type

2024-10-07 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117004

--- Comment #1 from Andrew Pinski  ---
I think there is a dup of this one around.

[Bug c++/117008] New: -march=native pessimization of 25% with bitset

2024-10-07 Thread mattreecebentley at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117008

Bug ID: 117008
   Summary: -march=native pessimization of 25% with bitset
   Product: gcc
   Version: 13.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: mattreecebentley at gmail dot com
  Target Milestone: ---

Created attachment 59292
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=59292&action=edit
ii file

Overview: Found a sequence of code using bitset where using -march=native and
-O2 is 25% slower than just -O2, under a Intel i7-9750H. Repeatable, also
occurs on an Intel i7-3770 but with a much lower decrease in performance
(around ~5%).

At -O2 runtime duration is ~15 seconds, -march=native;-O2 it's ~20 seconds.

Have used a PCG-based rand() in the code since regular rand() slows down the
program by 5x (difference in runtime duration between -march=native;-O2 and -O2
is still the same, but percentage change is dramatically influenced by rand
taking all the CPU time).

Code adds values to a total to prevent the code being optimized out. And yes,
.count() would've probably been more typical code. 


GCC version: 13.2.0


System type: x86_64-w64-mingw32, windows, x64


Configured with:

 ../src/configure --enable-languages=c,c++ --build=x86_64-w64-mingw32
--host=x86_64-w64-mingw32 --target=x86_64-w64-mingw32 --disable-multilib
--prefix=/e/temp/gcc/dest --with-sysroot=/e/temp/gcc/dest
--disable-libstdcxx-pch --disable-libstdcxx-verbose --disable-nls
--disable-shared --disable-win32-registry --enable-threads=posix
--enable-libgomp --with-zstd=/c/mingw


The complete command line that triggers the bug, for -O2 only: 

C:/programming/libraries/nuwen/bin/g++.exe  -c 
"C:/programming/workspaces/march_pessimisation_demo/march_pessimisation_demo.cpp"
-O2 -std=c++23 -s -save-temps -DNDEBUG  -o
build-Release/march_pessimisation_demo.cpp.o -I. -I.
C:/programming/libraries/nuwen/bin/g++.exe -o
build-Release\bin\march_pessimisation_demo.exe @build-Release/ObjectsList.txt
-L.   -O2 -s

Or for -march=native:

C:/programming/libraries/nuwen/bin/g++.exe  -c 
"C:/programming/workspaces/march_pessimisation_demo/march_pessimisation_demo.cpp"
-O2 -march=native -std=c++23 -s -save-temps -DNDEBUG  -o
build-Release/march_pessimisation_demo.cpp.o -I. -I.
C:/programming/libraries/nuwen/bin/g++.exe -o
build-Release\bin\march_pessimisation_demo.exe @build-Release/ObjectsList.txt
-L. -s


The compiler output (error messages, warnings, etc.):

No errors or warnings.

[Bug middle-end/117009] New: Wall should be in common.opt rather than the language specific .opt

2024-10-07 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117009

Bug ID: 117009
   Summary: Wall should be in common.opt rather than the language
specific .opt
   Product: gcc
   Version: 15.0
Status: UNCONFIRMED
  Keywords: internal-improvement
  Severity: normal
  Priority: P3
 Component: middle-end
  Assignee: pinskia at gcc dot gnu.org
  Reporter: pinskia at gcc dot gnu.org
  Target Milestone: ---

While helping jemarch (on IRC) with language specific option handling, it was
noticed that Wall is defined in some .opt files.

It should be in defined in the common.opt and added to the language .opt iff
there needs some special handling non EnabledBy for it.


Right now it is in:
c-family/c.opt:Wall
d/lang.opt:Wall
fortran/lang.opt:Wall
go/lang.opt:Wall
m2/lang.opt:Wall
rust/lang.opt:Wall


But it should only be in c-family/c.opt, d/lang.opt and m2/lang.opt:
c-family/c-opts.cc:case OPT_Wall:
d/d-lang.cc:case OPT_Wall:
m2/gm2-lang.cc:case OPT_Wall:


Note Wextra does not need to be in none of these either:
c-family/c.opt:Wextra
d/lang.opt:Wextra
fortran/lang.opt:Wextra
Because there is no special handling of OPT_Wextra.

[Bug middle-end/117009] Wall should be in common.opt rather than the language specific .opt

2024-10-07 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117009

Andrew Pinski  changed:

   What|Removed |Added

   Last reconfirmed||2024-10-08
 Ever confirmed|0   |1
 Status|UNCONFIRMED |ASSIGNED

--- Comment #1 from Andrew Pinski  ---
Mine to handle.

[Bug target/117008] -march=native pessimization of 25% with bitset

2024-10-07 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117008

--- Comment #1 from Andrew Pinski  ---
Can you provide the output of invoking g++ with -march=native and when
compiling?

[Bug analyzer/116996] New: Missed Detection of Null Pointer Dereference Issues

2024-10-07 Thread tianxinghe at smail dot nju.edu.cn via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116996

Bug ID: 116996
   Summary: Missed Detection of Null Pointer Dereference Issues
   Product: gcc
   Version: 11.4.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: analyzer
  Assignee: dmalcolm at gcc dot gnu.org
  Reporter: tianxinghe at smail dot nju.edu.cn
  Target Milestone: ---

$ gcc -v

Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/11/lto-wrapper
OFFLOAD_TARGET_NAMES=nvptx-none:amdgcn-amdhsa
OFFLOAD_TARGET_DEFAULT=1
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu
11.4.0-1ubuntu1~22.04' --with-bugurl=file:///usr/share/doc/gcc-11/README.Bugs
--enable-languages=c,ada,c++,go,brig,d,fortran,objc,obj-c++,m2 --prefix=/usr
--with-gcc-major-version-only --program-suffix=-11
--program-prefix=x86_64-linux-gnu- --enable-shared --enable-linker-build-id
--libexecdir=/usr/lib --without-included-gettext --enable-threads=posix
--libdir=/usr/lib --enable-nls --enable-bootstrap --enable-clocale=gnu
--enable-libstdcxx-debug --enable-libstdcxx-time=yes
--with-default-libstdcxx-abi=new --enable-gnu-unique-object
--disable-vtable-verify --enable-plugin --enable-default-pie --with-system-zlib
--enable-libphobos-checking=release --with-target-system-zlib=auto
--enable-objc-gc=auto --enable-multiarch --disable-werror --enable-cet
--with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32
--enable-multilib --with-tune=generic
--enable-offload-targets=nvptx-none=/build/gcc-11-XeT9lY/gcc-11-11.4.0/debian/tmp-nvptx/usr,amdgcn-amdhsa=/build/gcc-11-XeT9lY/gcc-11-11.4.0/debian/tmp-gcn/usr
--without-cuda-driver --enable-checking=release --build=x86_64-linux-gnu
--host=x86_64-linux-gnu --target=x86_64-linux-gnu
--with-build-config=bootstrap-lto-lean --enable-link-serialization=2
Thread model: posix
Supported LTO compression algorithms: zlib zstd
gcc version 11.4.0 (Ubuntu 11.4.0-1ubuntu1~22.04) 

gcc 1.c -fanalyzer

inputs can be ./t 1826809252 0 0

The checker miss the null dereference bug in line: v7 = *v6;
This bug can be detected by GCC 14.2.0.

The bug path: 
main: entry - block2 - block5 
func: entry - block1 - block4 - block5 - block2 - block6


minimal test case:
#include 
#include 
#ifndef __cplusplus
typedef unsigned char bool;
#endif

void init(char**);
uint8_t* malloc(uint64_t);
uint64_t atol(uint8_t*);
void free(uint8_t*);
int main(int, char **);
void func(uint64_t**, uint64_t*, uint64_t**, uint64_t);

uint64_t* args;

void init(char** argv) {
 args = (uint64_t*) malloc(3 * sizeof(uint64_t));
 for (int i = 1; i <= 3; ++i) {
 args[i - 1] = atol(argv[i]);
 }
}

int main(int argc, char ** argv) {
  uint64_t input0;
  uint64_t* v1;
  uint64_t* v2;
  uint64_t v3;
  uint64_t v4;
  bool v5;
  uint64_t v6;

  init(argv);
  input0 = args[0];
  v3 = input0 * 37;
  v2 = (uint64_t*)/*NULL*/0;
  v6 = 1320690439;
  if (input0 == 1826809252) {
goto block2;
  } else {
goto block3;
  }

block5:
  v1 = &v4;
  *v1 = 171952983;
  func(&v1, &v3, &v2, input0);
  return 0;

block2:
  if (input0 >= 2606378767) {
goto block4;
  } else {
goto block5;
  }

block3:
  v2 = &v6;
  goto block2;

block4:
  v2 = &v6;
  goto block5; 
}

void func(uint64_t** a1, uint64_t* a2, uint64_t** a3, uint64_t a4) {
  uint64_t v1;
  uint8_t v2;
  uint64_t v3;
  uint64_t v4;
  uint64_t* v5;
  uint64_t* v6;
  uint64_t v7;
  uint64_t* v8;
  uint64_t** v9;
  uint64_t* v10;
  uint64_t** v11;
  uint64_t* v12;
  uint64_t v13;
  uint64_t* v14;

  v2 = ((uint8_t)a4);
  v3 = *a2;
  v4 = ((int64_t)(int8_t)v2);
  if (v4 >= (int64_t)v3) {
goto block1;
  } else {
goto block2;
  }

block5:
  goto block2;

block4:
  v5 = *v9;
  *v5 = 1;
  goto block5;

block6:
  v6 = *a3;
  v7 = *v6;
  v13 = v7;
  return;

block3:
  *v9 = (&v1);
  goto block4;

block2:
  v8 = *a1;
  *v8 = 1;
  goto block6;

block1:
  v9 = (&v14);
  v14 = &v13;
  if ((int64_t)v3 <= (int64_t)v4) {
goto block3;
  } else {
goto block4;
  }
}

[Bug analyzer/116995] New: Missed Detection of Null Pointer Dereference Issues

2024-10-07 Thread tianxinghe at smail dot nju.edu.cn via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116995

Bug ID: 116995
   Summary: Missed Detection of Null Pointer Dereference Issues
   Product: gcc
   Version: 11.4.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: analyzer
  Assignee: dmalcolm at gcc dot gnu.org
  Reporter: tianxinghe at smail dot nju.edu.cn
  Target Milestone: ---

$ gcc -v

Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/11/lto-wrapper
OFFLOAD_TARGET_NAMES=nvptx-none:amdgcn-amdhsa
OFFLOAD_TARGET_DEFAULT=1
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu
11.4.0-1ubuntu1~22.04' --with-bugurl=file:///usr/share/doc/gcc-11/README.Bugs
--enable-languages=c,ada,c++,go,brig,d,fortran,objc,obj-c++,m2 --prefix=/usr
--with-gcc-major-version-only --program-suffix=-11
--program-prefix=x86_64-linux-gnu- --enable-shared --enable-linker-build-id
--libexecdir=/usr/lib --without-included-gettext --enable-threads=posix
--libdir=/usr/lib --enable-nls --enable-bootstrap --enable-clocale=gnu
--enable-libstdcxx-debug --enable-libstdcxx-time=yes
--with-default-libstdcxx-abi=new --enable-gnu-unique-object
--disable-vtable-verify --enable-plugin --enable-default-pie --with-system-zlib
--enable-libphobos-checking=release --with-target-system-zlib=auto
--enable-objc-gc=auto --enable-multiarch --disable-werror --enable-cet
--with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32
--enable-multilib --with-tune=generic
--enable-offload-targets=nvptx-none=/build/gcc-11-XeT9lY/gcc-11-11.4.0/debian/tmp-nvptx/usr,amdgcn-amdhsa=/build/gcc-11-XeT9lY/gcc-11-11.4.0/debian/tmp-gcn/usr
--without-cuda-driver --enable-checking=release --build=x86_64-linux-gnu
--host=x86_64-linux-gnu --target=x86_64-linux-gnu
--with-build-config=bootstrap-lto-lean --enable-link-serialization=2
Thread model: posix
Supported LTO compression algorithms: zlib zstd
gcc version 11.4.0 (Ubuntu 11.4.0-1ubuntu1~22.04) 

gcc 1.c -fanalyzer

inputs can be ./t 1826809252 0 0

The checker miss the null dereference bug in line: v7 = *v6;
This bug can be detected by GCC 14.2.0.

The bug path: 
main: entry - block2 - block5 
func: entry - block1 - block4 - block5 - block2 - block6


minimal test case:
#include 
#include 
#ifndef __cplusplus
typedef unsigned char bool;
#endif

void init(char**);
uint8_t* malloc(uint64_t);
uint64_t atol(uint8_t*);
void free(uint8_t*);
int main(int, char **);
void func(uint64_t**, uint64_t*, uint64_t**, uint64_t);

uint64_t* args;

void init(char** argv) {
 args = (uint64_t*) malloc(3 * sizeof(uint64_t));
 for (int i = 1; i <= 3; ++i) {
 args[i - 1] = atol(argv[i]);
 }
}

int main(int argc, char ** argv) {
  uint64_t input0;
  uint64_t* v1;
  uint64_t* v2;
  uint64_t v3;
  uint64_t v4;
  bool v5;
  uint64_t v6;

  init(argv);
  input0 = args[0];
  v3 = input0 * 37;
  v2 = (uint64_t*)/*NULL*/0;
  v6 = 1320690439;
  if (input0 == 1826809252) {
goto block2;
  } else {
goto block3;
  }

block5:
  v1 = &v4;
  *v1 = 171952983;
  func(&v1, &v3, &v2, input0);
  return 0;

block2:
  if (input0 >= 2606378767) {
goto block4;
  } else {
goto block5;
  }

block3:
  v2 = &v6;
  goto block2;

block4:
  v2 = &v6;
  goto block5; 
}

void func(uint64_t** a1, uint64_t* a2, uint64_t** a3, uint64_t a4) {
  uint64_t v1;
  uint8_t v2;
  uint64_t v3;
  uint64_t v4;
  uint64_t* v5;
  uint64_t* v6;
  uint64_t v7;
  uint64_t* v8;
  uint64_t** v9;
  uint64_t* v10;
  uint64_t** v11;
  uint64_t* v12;
  uint64_t v13;
  uint64_t* v14;

  v2 = ((uint8_t)a4);
  v3 = *a2;
  v4 = ((int64_t)(int8_t)v2);
  if (v4 >= (int64_t)v3) {
goto block1;
  } else {
goto block2;
  }

block5:
  goto block2;

block4:
  v5 = *v9;
  *v5 = 1;
  goto block5;

block6:
  v6 = *a3;
  v7 = *v6;
  v13 = v7;
  return;

block3:
  *v9 = (&v1);
  goto block4;

block2:
  v8 = *a1;
  *v8 = 1;
  goto block6;

block1:
  v9 = (&v14);
  v14 = &v13;
  if ((int64_t)v3 <= (int64_t)v4) {
goto block3;
  } else {
goto block4;
  }
}

[Bug tree-optimization/116998] [15 Regression] 5% slowdown of 400.perlbench on AMD Zen3/4 since r15-3986-g3e1bd6470e4deb

2024-10-07 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116998

Richard Biener  changed:

   What|Removed |Added

   Keywords||needs-testcase

--- Comment #1 from Richard Biener  ---
As noted in the bug what PRE considers possibly trapping is quite conservative
(and has to, as no flow-sensitive info can be used easily).  But the fix was a
correctness one so I'm not sure what we can do.

Needs a testcase showing the 400.perlbench case for analysis.

[Bug target/116979] [12/13/14/15 regression] fma not always used in complex product

2024-10-07 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116979

--- Comment #5 from Richard Biener  ---
(In reply to Andrew Pinski from comment #4)
> x86:
> 
> /app/example.cpp:6:1: note: Cost model analysis for part in loop 0:
>   Vector cost: 172
>   Scalar cost: 184
> 
> 
> 
> aarch64:
> /app/example.cpp:6:1: note: Cost model analysis for part in loop 0:
>   Vector cost: 37
>   Scalar cost: 12
> 
> So yes a cost model issue.

Note the vectorizer only does overall costing vector vs. scalar, it doesn't
cost the variant with no addsub because on the vector side that's inferior
(it would use blend).  The vectorizer doesn't see FMAs, those get introduced
later.  Backend costing would need to anticipate FMA use for scalar
costing and anticipate FMA cannot be used in the vector version.  That's
currently very difficult due to the lack of data dependence info on the
vector cost side.

If you want FMA for precision reasons you should probably use std::fma(..)
directly.

[Bug tree-optimization/116982] [14/15 Regregression] ICE on valid code at -O3 with "-fno-tree-dce -fno-tree-dominator-opts -fno-tree-pre -fno-tree-dse -fno-tree-copy-prop -fno-tree-fre -fno-code-hoist

2024-10-07 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116982

Richard Biener  changed:

   What|Removed |Added

 Status|NEW |ASSIGNED
Version|unknown |15.0
   Assignee|unassigned at gcc dot gnu.org  |rguenth at gcc dot 
gnu.org

--- Comment #2 from Richard Biener  ---
I will have a look.

[Bug c++/113958] support visibility attribute for typeinfo symbol

2024-10-07 Thread nikolasklauser at berlin dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113958

Nikolas Klauser  changed:

   What|Removed |Added

 CC||nikolasklauser at berlin dot de

--- Comment #4 from Nikolas Klauser  ---
This would also be really useful for libc++ in a slightly different usage
pattern. Generally, we want everything to have hidden visibility to avoid
baking lots of functions into the ABI, but the typeinfo and friends have to
have default visibility to support `dynamic_cast`ing across shared libraries.
If we have the attribute available we can simply have `namespace
__attribute__((visibility("hidden"), type_visibility("default"))) std { ... }`
instead of having to annotate every single class. This would save us north of
800 annotations (plus ones we're missing) across the library, since we can roll
this into our `_LIBCPP_BEGIN_NAMESPACE_STD` macro.

[Bug analyzer/116996] Missed Detection of Null Pointer Dereference Issues

2024-10-07 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116996

Richard Biener  changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |DUPLICATE

--- Comment #1 from Richard Biener  ---
.

*** This bug has been marked as a duplicate of bug 116995 ***

[Bug analyzer/116995] Missed Detection of Null Pointer Dereference Issues

2024-10-07 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116995

--- Comment #1 from Richard Biener  ---
*** Bug 116996 has been marked as a duplicate of this bug. ***

[Bug middle-end/116997] [13/14/15 Regression] Wrong bitfield accesses since r13-3219-g25413fdb2ac249

2024-10-07 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116997

Richard Biener  changed:

   What|Removed |Added

 CC||avieira at gcc dot gnu.org,
   ||rguenth at gcc dot gnu.org
 Blocks||53947
   Target Milestone|--- |13.4
Version|unknown |15.0


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
[Bug 53947] [meta-bug] vectorizer missed-optimizations

[Bug tree-optimization/116855] [14/15 Regression] Unsafe early-break vectorization

2024-10-07 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116855

--- Comment #7 from rguenther at suse dot de  ---
On Sun, 6 Oct 2024, fxue at os dot amperecomputing.com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116855
> 
> --- Comment #5 from Feng Xue  ---
> (In reply to Tamar Christina from comment #4)
> > (In reply to Richard Biener from comment #3)
> > > I would suggest to add a STMT_VINFO_ENSURE_NOTRAP or so and delay actual
> > > verification to vectorizable_load when both vector type and VF are fixed.
> > > We'd then likely need a LOOP_VINFO_MUST_USE_PARTIAL_VECTORS_P set
> > > conservatively the other way around from
> > > LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P.
> > > Alignment peeling could then peel if STMT_VINFO_ENSURE_NOTRAP and the 
> > > target
> > > cannot do full loop masking.
> > 
> > 
> > Yeah the original reported testcase is fine as the alignment makes it safe.
> > For the manually misaligned case that Andrew posted it makes sense to delay
> > and re-evaluate later on.
> > 
> > I don't think we should bother peeling though, I don't think they're that
> > common and alignment peeling breaks some dominators and exposes some
> > existing vectorizer bugs, which is being fixed in Alex's patch.
> > 
> > So at least alignment peeling I'll defer to a later point and instead just
> > reject loops that are loading from structures the user misaligned wrt to the
> > vector mode.
> > 
> > 
> > So mine..
> 
> Actually, what I wish is that we could allow vectorization on early break case
> for arbitrary address pointer (knowing nothing about alignment and bound) 
> based
> on some sort of assumption specified via command option under -Ofast, as the
> mentioned example:

I'd rather not have more command-line options gating "unsafe" transforms
but instead have source-level control per loop via pragma.  It should
probably specify a length like simdlen, specifying that accessing
[start + n * accesslen, start + (n+1)*accesslen - 1] is OK when
the scalar loop accesses [start + n * accesslen] or so.

> char * find(char *string, size_t n, char c)
> {
> for (size_t i = 0; i < n; i++) {
> if (string[i] == c)
> return &string[i];
> }
> return 0;
> }
> 
> and example for which there is no way to do peeling to align more than one
> address pointers:
> 
> int compare(char *string1, char *string2, size_t n)
> {
> for (size_t i = 0; i < n; i++) {
> if (string1[i] != string2[i])
> return string1[i] - string2[i];
> }
> return 0;
> }

[Bug target/116934] [15 Regression] ICE building 526.blender_r

2024-10-07 Thread ktkachov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116934

ktkachov at gcc dot gnu.org changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |FIXED

--- Comment #5 from ktkachov at gcc dot gnu.org ---
Thanks for the fix!

[Bug testsuite/116683] new test g++.dg/ext/pragma-unroll-lambda-lto.C from r15-3585-g9759f6299d9633 fails

2024-10-07 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116683

Alex Coplan  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #7 from Alex Coplan  ---
Fixed.

[Bug testsuite/116683] new test g++.dg/ext/pragma-unroll-lambda-lto.C from r15-3585-g9759f6299d9633 fails

2024-10-07 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116683

--- Comment #6 from GCC Commits  ---
The master branch has been updated by Alex Coplan :

https://gcc.gnu.org/g:7faadb1f261c6b8ef988c400c39ec7df09839dbe

commit r15-4106-g7faadb1f261c6b8ef988c400c39ec7df09839dbe
Author: Alex Coplan 
Date:   Thu Sep 26 16:36:48 2024 +0100

testsuite: Prevent unrolling of main in LTO test [PR116683]

In r15-3585-g9759f6299d9633cabac540e5c893341c708093ac I added a test which
started failing on PowerPC.  The test checks that we unroll exactly one
loop
three times with the following:

// { dg-final { scan-ltrans-rtl-dump-times "Unrolled loop 3 times" 1
"loop2_unroll" } }

which passes on most targets.  However, on PowerPC, the loop in main
gets unrolled too, causing the scan-ltrans-rtl-dump-times check to fail
as the statement now appears twice in the dump.  I think the extra
unrolling is due to different unrolling heuristics in the rs6000 port.

This patch therefore explicitly tries to block the unrolling in main with
an
appropriate #pragma.

gcc/testsuite/ChangeLog:

PR testsuite/116683
* g++.dg/ext/pragma-unroll-lambda-lto.C (main): Add #pragma to
prevent unrolling of the setup loop.

[Bug tree-optimization/116998] New: [15 Regression] 5% slowdown of 400.perlbench on AMD Zen3/4 since r15-3986-g3e1bd6470e4deb

2024-10-07 Thread pheeck at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116998

Bug ID: 116998
   Summary: [15 Regression] 5% slowdown of 400.perlbench on AMD
Zen3/4 since r15-3986-g3e1bd6470e4deb
   Product: gcc
   Version: 15.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: pheeck at gcc dot gnu.org
CC: rguenth at gcc dot gnu.org
Blocks: 26163
  Target Milestone: ---
  Host: x86_64-pc-linux-gnu
Target: x86_64-pc-linux-gnu

As seen here

https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=956.10.0
https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=469.10.0

there was a ~5% exec time slowdown of the 400.perlbench SPEC 2006 benchmark
when run with -O2 -flto on AMD Zen3/4 machines (maybe also on other Zen
microarchs).  I bisected the slowdown to r15-3986-g3e1bd6470e4deb (Cc-ing
richi).

See comparison with other branches here:

https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=938.10.0&plot.1=999.10.0&plot.2=971.10.0&plot.3=1010.10.0&plot.4=956.10.0&;


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=26163
[Bug 26163] [meta-bug] missed optimization in SPEC (2k17, 2k and 2k6 and 95)

[Bug target/116999] New: Fold SVE whilelt/le comparisons with max int value to ptrue

2024-10-07 Thread ktkachov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116999

Bug ID: 116999
   Summary: Fold SVE whilelt/le comparisons with max int value to
ptrue
   Product: gcc
   Version: 15.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: ktkachov at gcc dot gnu.org
  Target Milestone: ---
Target: aarch64

Example testcase:
#include 
#include 


svbool_t
foo_s32_le (int32_t x)
{
  return svwhilele_b64_s32 (x, INT_MAX);
}

svbool_t
foo_s64_le (int64_t x)
{
  return svwhilele_b64_s64 (x, LONG_LONG_MAX);
}

can avoid generating the WHILELE instructions and just generate a PTRUE.
This is as per the WHILELE documentation:
"If the second scalar operand is equal to the maximum signed integer value then
a condition which includes an equality test can never fail and the result will
be an all-true predicate."

Note that we probably want to look at the use of the flags from the whilele as
well. If we cannot prove that the NZCV are unused then we have to generate a
PTRUES instead, I think

[Bug tree-optimization/116998] [15 Regression] 5% slowdown of 400.perlbench on AMD Zen3/4 since r15-3986-g3e1bd6470e4deb

2024-10-07 Thread pheeck at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116998

Filip Kastl  changed:

   What|Removed |Added

   Target Milestone|--- |15.0

[Bug target/116999] Fold SVE whilelt/le comparisons with max int value to ptrue

2024-10-07 Thread ktkachov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116999

--- Comment #1 from ktkachov at gcc dot gnu.org ---
This is inspired by the LLVM PR
https://github.com/llvm/llvm-project/pull/83

[Bug middle-end/116896] codegen for <=> compared to hand-written equivalent

2024-10-07 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116896

--- Comment #29 from GCC Commits  ---
The master branch has been updated by Jakub Jelinek :

https://gcc.gnu.org/g:37554bacfd38b1466278b529d9e70a44d7b1b909

commit r15-4105-g37554bacfd38b1466278b529d9e70a44d7b1b909
Author: Jakub Jelinek 
Date:   Mon Oct 7 10:50:39 2024 +0200

ssa-math-opts, i386: Improve spaceship expansion [PR116896]

The PR notes that we don't emit optimal code for C++ spaceship
operator if the result is returned as an integer rather than the
result just being compared against different values and different
code executed based on that.
So e.g. for
template 
auto foo (T x, T y) { return x <=> y; }
for both floating point types, signed integer types and unsigned integer
types.  auto in that case is std::strong_ordering or std::partial_ordering,
which are fancy C++ abstractions around struct with signed char member
which is -1, 0, 1 for the strong ordering and -1, 0, 1, 2 for the partial
ordering (but for -ffast-math 2 is never the case).
I'm afraid functions like that are fairly common and unless they are
inlined, we really need to map the comparison to those -1, 0, 1 or
-1, 0, 1, 2 values.

Now, for floating point spaceship I've in the past already added an
optimization (with tree-ssa-math-opts.cc discovery and named optab, the
optab only defined on x86 though right now), which ensures there is just
a single comparison instruction and then just tests based on flags.
Now, if we have code like:
  auto a = x <=> y;
  if (a == std::partial_ordering::less)
bar ();
  else if (a == std::partial_ordering::greater)
baz ();
  else if (a == std::partial_ordering::equivalent)
qux ();
  else if (a == std::partial_ordering::unordered)
corge ();
etc., that results in decent code generation, the spaceship named pattern
on x86 optimizes for the jumps, so emits comparisons on the flags, followed
by setting the result to -1, 0, 1, 2 and subsequent jump pass optimizes
that
well.  But if the result needs to be stored into an integer and just
returned that way or there are no immediate jumps based on it (or turned
into some non-standard integer values like -42, 0, 36, 75 etc.), then CE
doesn't do a good job for that, we end up with say
comiss  %xmm1, %xmm0
jp  .L4
seta%al
movl$0, %edx
leal-1(%rax,%rax), %eax
cmove   %edx, %eax
ret
.L4:
movl$2, %eax
ret
The jp is good, that is the unlikely case and can't be easily handled in
straight line code due to the layout of the flags, but the rest uses cmov
which often isn't a win and a weird math.
With the patch below we can get instead
xorl%eax, %eax
comiss  %xmm1, %xmm0
jp  .L2
seta%al
sbbl$0, %eax
ret
.L2:
movl$2, %eax
ret

The patch changes the discovery in the generic code, by detecting if
the future .SPACESHIP result is just used in a PHI with -1, 0, 1 or
-1, 0, 1, 2 values (the latter for HONOR_NANS) and passes that as a flag in
a new argument to .SPACESHIP ifn, so that the named pattern is told whether
it should optimize for branches or for loading the result into a -1, 0, 1
(, 2) integer.  Additionally, it doesn't detect just floating point <=>
anymore, but also integer and unsigned integer, but in those cases only
if an integer -1, 0, 1 is wanted (otherwise == and > or similar comparisons
result in good code).
The backend then can for those integer or unsigned integer <=>s return
effectively (x > y) - (x < y) in a way that is efficient on the target
(so for x86 with ensuring zero initialization first when needed before
setcc; one for floating point and unsigned, where there is just one setcc
and the second one optimized into sbb instruction, two for the signed int
case).  So e.g. for signed int we now emit
xorl%edx, %edx
xorl%eax, %eax
cmpl%esi, %edi
setl%dl
setg%al
subl%edx, %eax
ret
and for unsigned
xorl%eax, %eax
cmpl%esi, %edi
seta%al
sbbb$0, %al
ret

Note, I wonder if other targets wouldn't benefit from defining the
named optab too...

2024-10-07  Jakub Jelinek  

PR middle-end/116896
* optabs.def (spaceship_optab): Use spaceship$a4 rather than
spaceship$a3.
* internal-fn.cc (expand_SPACESHIP): Expect 3 call arguments
rather than 2, expand the last one, expect 4 operands of
spaceship_optab.
* tree-ssa-math-opts.cc: Include cfghooks.h.
   

[Bug libstdc++/115585] [12 Regression] --disable-libstdcxx-verbose causes undefined symbol: _ZSt21__glibcxx_assert_failPKciS0_S0_, version GLIBCXX_3.4.30

2024-10-07 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115585

--- Comment #15 from GCC Commits  ---
The releases/gcc-12 branch has been updated by Jonathan Wakely
:

https://gcc.gnu.org/g:c4d2f51741bbb1771219fbeaaf812fa73c36fc0f

commit r12-10747-gc4d2f51741bbb1771219fbeaaf812fa73c36fc0f
Author: Jonathan Wakely 
Date:   Fri Jun 28 15:14:15 2024 +0100

libstdc++: Define __glibcxx_assert_fail for non-verbose build [PR115585]

When the library is configured with --disable-libstdcxx-verbose the
assertions just abort instead of calling __glibcxx_assert_fail, and so I
didn't export that function for the non-verbose build. However, that
option is documented to not change the library ABI, so we still need to
export the symbol from the library. It could be needed by programs
compiled against the headers from a verbose build.

The non-verbose definition can just call abort so that it doesn't pull
in I/O symbols, which are unwanted in a non-verbose build.

libstdc++-v3/ChangeLog:

PR libstdc++/115585
* src/c++11/assert_fail.cc (__glibcxx_assert_fail): Add
definition for non-verbose builds.

(cherry picked from commit 52370c839edd04df86d3ff2b71fcdca0c7376a7f)

[Bug libstdc++/116641] [12 Regression] std::string move assignment incorrectly depends on POCCA

2024-10-07 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116641

--- Comment #4 from GCC Commits  ---
The releases/gcc-12 branch has been updated by Jonathan Wakely
:

https://gcc.gnu.org/g:2ab55da5eba0aa7a92e15d8100d51cc977f9aca4

commit r12-10748-g2ab55da5eba0aa7a92e15d8100d51cc977f9aca4
Author: Jonathan Wakely 
Date:   Tue Sep 10 14:25:41 2024 +0100

libstdc++: std::string move assignment should not use POCCA trait
[PR116641]

The changes to implement LWG 2579 (r10-327-gdb33efde17932f) made
std::string::assign use the propagate_on_container_copy_assignment
(POCCA) trait, for consistency with operator=(const basic_string&).
However, this also unintentionally affected operator=(basic_string&&)
which calls assign(str) to make a deep copy when performing a move is
not possible. The fix is for the move assignment operator to call
_M_assign(str) instead of assign(str), as this just does the deep copy
and doesn't check the POCCA trait first.

The bug only affects the unlikely/useless combination of POCCA==true and
POCMA==false, but we should fix it for correctness anyway. it should
also make move assignment slightly cheaper to compile and execute,
because we skip the extra code in assign(const basic_string&).

libstdc++-v3/ChangeLog:

PR libstdc++/116641
* include/bits/basic_string.h (operator=(basic_string&&)): Call
_M_assign instead of assign.
* testsuite/21_strings/basic_string/allocator/116641.cc: New
test.

(cherry picked from commit c07cf418fdde0c192e370a8d76a991cc7215e9c4)

[Bug libstdc++/115399] std::tr2::dynamic_bitset shift behaves differently from std::bitset

2024-10-07 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115399

--- Comment #9 from GCC Commits  ---
The releases/gcc-12 branch has been updated by Jonathan Wakely
:

https://gcc.gnu.org/g:1f655ef43621cc022745c3aa9c77e3725b9280cd

commit r12-10753-g1f655ef43621cc022745c3aa9c77e3725b9280cd
Author: Jonathan Wakely 
Date:   Mon Jun 10 14:08:16 2024 +0100

libstdc++: Fix std::tr2::dynamic_bitset shift operations [PR115399]

The shift operations for dynamic_bitset fail to zero out words where the
non-zero bits were shifted to a completely different word.

For a right shift we don't need to sanitize the unused bits in the high
word, because we know they were already clear and a right shift doesn't
change that.

libstdc++-v3/ChangeLog:

PR libstdc++/115399
* include/tr2/dynamic_bitset (operator>>=): Remove redundant
call to _M_do_sanitize.
* include/tr2/dynamic_bitset.tcc (_M_do_left_shift): Zero out
low bits in words that should no longer be populated.
(_M_do_right_shift): Likewise for high bits.
* testsuite/tr2/dynamic_bitset/pr115399.cc: New test.

(cherry picked from commit bd3a312728fbf8c35a09239b9180269f938f872e)

[Bug libstdc++/116641] [12 Regression] std::string move assignment incorrectly depends on POCCA

2024-10-07 Thread redi at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116641

Jonathan Wakely  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #5 from Jonathan Wakely  ---
Fixed for 12.5, 13.4, 14.3

[Bug libstdc++/115585] [12 Regression] --disable-libstdcxx-verbose causes undefined symbol: _ZSt21__glibcxx_assert_failPKciS0_S0_, version GLIBCXX_3.4.30

2024-10-07 Thread redi at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115585

Jonathan Wakely  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|NEW |RESOLVED

--- Comment #16 from Jonathan Wakely  ---
Fixed for 12.5, 13.4 and 14.2

[Bug libstdc++/115399] std::tr2::dynamic_bitset shift behaves differently from std::bitset

2024-10-07 Thread redi at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115399

--- Comment #10 from Jonathan Wakely  ---
And 13.4 and 12.5

[Bug libstdc++/115399] std::tr2::dynamic_bitset shift behaves differently from std::bitset

2024-10-07 Thread redi at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115399

Jonathan Wakely  changed:

   What|Removed |Added

   Target Milestone|13.4|12.5

[Bug tree-optimization/116982] [14/15 Regregression] ICE on valid code at -O3 with "-fno-tree-dce -fno-tree-dominator-opts -fno-tree-pre -fno-tree-dse -fno-tree-copy-prop -fno-tree-fre -fno-code-hoist

2024-10-07 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116982

--- Comment #3 from Richard Biener  ---
The issue is likely that if-conversion produced a vector loop copy with a
different number of exit edges than the original not if-converted version
because of if-conversion doing simple DCE/FRE/DSE but the reporter disabling
all those opts.

[Bug target/116979] [12/13/14/15 regression] fma not always used in complex product

2024-10-07 Thread vincenzo.innocente at cern dot ch via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116979

--- Comment #6 from vincenzo Innocente  ---
I'm just taking the product of two complex numbers, cannot call std::fma in the
user code: reimplementing the operator* is not trivial (and is a stdlib job
anyhow)

[Bug libstdc++/116991] FAIL: 26_numerics/complex/ext_c++23.cc -std=gnu++23 (test for excess errors)

2024-10-07 Thread redi at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116991

Jonathan Wakely  changed:

   What|Removed |Added

 Ever confirmed|0   |1
   Assignee|unassigned at gcc dot gnu.org  |redi at gcc dot gnu.org
   Last reconfirmed||2024-10-07
   Target Milestone|--- |15.0
 Status|UNCONFIRMED |ASSIGNED

--- Comment #1 from Jonathan Wakely  ---
The warning is for the initialization from a long double literal in this
function template:

  template
inline std::complex<_Tp>
__complex_acos(const std::complex<_Tp>& __z)
{
  const std::complex<_Tp> __t = std::asin(__z);
  const _Tp __pi_2 = 1.5707963267948966192313216916397514L;
  return std::complex<_Tp>(__pi_2 - __t.real(), -__t.imag());
}

This function template isn't used for _Float32, _Float64 etc. on other targets,
because they define _GLIBCXX_USE_C99_COMPLEX_ARC and so have overloads for each
type, like:

  inline __complex__ _Float32
  __complex_acos(__complex__ _Float32 __z)
  { return __builtin_cacosf(__z); }

We can just use a cast in the generic __complex_acos.

[Bug tree-optimization/116990] [15 Regression] ICE on valid code at -O3 "-fno-tree-ccp -fno-tree-loop-im -fno-tree-dse" on x86_64-linux-gnu: in single_pred_edge, at basic-block.h:342

2024-10-07 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116990

--- Comment #4 from Richard Biener  ---
Hum.  The following should have prevented that, but ...

  /* Check if we have any control flow that doesn't leave the loop.  */
  class loop *v_loop = loop->inner ? loop->inner : loop;

... not sure why we only look at the inner loop body?!

  basic_block *bbs = get_loop_body (v_loop);
  for (unsigned i = 0; i < v_loop->num_nodes; i++)
if (EDGE_COUNT (bbs[i]->succs) != 1
&& (EDGE_COUNT (bbs[i]->succs) != 2
|| !loop_exits_from_bb_p (bbs[i]->loop_father, bbs[i])))
  {
free (bbs);
return opt_result::failure_at (vect_location,
   "not vectorized:" 
   " unsupported control flow in loop.\n");
  }

[Bug tree-optimization/116990] [15 Regression] ICE on valid code at -O3 "-fno-tree-ccp -fno-tree-loop-im -fno-tree-dse" on x86_64-linux-gnu: in single_pred_edge, at basic-block.h:342

2024-10-07 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116990

Richard Biener  changed:

   What|Removed |Added

 Status|NEW |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |rguenth at gcc dot 
gnu.org

--- Comment #3 from Richard Biener  ---
Mine.

[Bug libstdc++/116992] FAIL: 30_threads/semaphore/platform_try_acquire_for.cc -std=gnu++20 (test for excess errors)

2024-10-07 Thread redi at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116992

Jonathan Wakely  changed:

   What|Removed |Added

 Status|UNCONFIRMED |ASSIGNED
 Ever confirmed|0   |1
   Last reconfirmed||2024-10-07
   Assignee|unassigned at gcc dot gnu.org  |redi at gcc dot gnu.org
   Target Milestone|--- |15.0

--- Comment #1 from Jonathan Wakely  ---
This test uses -D_GLIBCXX_USE_POSIX_SEMAPHORE to force the use of POSIX sem_t
as the base class for std::counting_semaphore, but that doesn't work on targets
without sem_t.

[Bug middle-end/116997] New: [13/14/15 Regression] Wrong bitfield accesses since r13-3219-g25413fdb2ac249

2024-10-07 Thread stefansf at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116997

Bug ID: 116997
   Summary: [13/14/15 Regression] Wrong bitfield accesses since
r13-3219-g25413fdb2ac249
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Keywords: wrong-code
  Severity: normal
  Priority: P3
 Component: middle-end
  Assignee: unassigned at gcc dot gnu.org
  Reporter: stefansf at gcc dot gnu.org
  Target Milestone: ---
Target: s390x-*-*

struct S0
{
  unsigned f0;
  signed f2 : 11;
  signed : 6;
} GlobS, *Ptr = &GlobS;

const struct S0 Initializer = {7, 3};

int main (void)
{
  for (unsigned i = 0; i <= 2; i++)
*Ptr = Initializer;
  if (GlobS.f2 != 3)
__builtin_abort ();
  return 0;
}

gcc -march=z13 -O2 t.c
(should fail for any arch which supports vector extensions)

During ifcvt we have

Start lowering bitfields
Lowering:
Ptr.0_1->f2 = 3;
to:
_ifc__24 = Ptr.0_1->D.2918;
_ifc__25 = BIT_INSERT_EXPR <_ifc__24, 3, 0 (11 bits)>;
Ptr.0_1->D.2918 = _ifc__25;
Done lowering bitfields
...
Match-and-simplified BIT_INSERT_EXPR <_ifc__24, 3, 0 (11 bits)> to 3
RHS BIT_INSERT_EXPR <_ifc__24, 3, 0 (11 bits)> simplified to 3
Setting value number of _ifc__25 to 3 (changed)
Replaced BIT_INSERT_EXPR <_ifc__24, 3, 0 (11 bits)> with 3 in all uses of
_ifc__25 = BIT_INSERT_EXPR <_ifc__24, 3, 0 (11 bits)>;
Value numbering stmt = Ptr.0_1->D.2918 = _ifc__25;

which in the end leads to the optimized tree output

int main ()
{
  struct S0 * Ptr.0_1;
  unsigned int _2;
  unsigned int _3;

   [local count: 268435458]:
  Ptr.0_1 = Ptr;
  MEM  [(void *)Ptr.0_1] = { 7, 3 };
  _2 = BIT_FIELD_REF ;
  _3 = _2 & 4292870144;
  if (_3 != 6291456)
goto ; [0.00%]
  else
goto ; [100.00%]

   [count: 0]:
  __builtin_abort ();

   [local count: 268435456]:
  return 0;

}

Since bitfields are left aligned on s390, constant 3 is wrong and should rather
be 0x60.

[Bug target/55212] [SH] Switch to LRA

2024-10-07 Thread kkojima at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=55212

--- Comment #384 from Kazumoto Kojima  ---
Created attachment 59289
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=59289&action=edit
a reduced test case for c#378 (with -O2 -fpic)

[Bug middle-end/116933] various issues of -ftrivial-auto-var-init=zero with Ada

2024-10-07 Thread qinzhao at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116933

--- Comment #16 from qinzhao at gcc dot gnu.org ---
(In reply to Eric Botcazou from comment #12)
> > We added one more argument for __builtin_clear_padding to distinguish
> > whether this call is for AUTO_INIT or not. 
> > > 
> > > diff --git a/gcc/tree.cc b/gcc/tree.cc
> > > index bc50afca9a3..095c02c5474 100644
> > > --- a/gcc/tree.cc
> > > +++ b/gcc/tree.cc
> > > @@ -9848,7 +9848,6 @@ build_common_builtin_nodes (void)
> > >   ftype = build_function_type_list (void_type_node,
> > >ptr_type_node,
> > >ptr_type_node,
> > > -   integer_type_node,
> > 
> > This integer_type_node is for the new argument.
> 
> See this assertion in gimple_fold_builtin_clear_padding:
> 
>  gcc_assert (gimple_call_num_args (stmt) == 2);

Okay, by searching the history, looks like that the following patch forget to
update the above routine when merging the 2nd and 3rd parameters for
__builtin_clear_padding: 

>From b56ad95854f0b007afda60c057f10b04666953c9 Mon Sep 17 00:00:00 2001
From: Jakub Jelinek 
Date: Fri, 11 Feb 2022 19:47:14 +0100
Subject: [PATCH] middle-end: Small __builtin_clear_padding improvements

When looking at __builtin_clear_padding today, I've noticed that
it is quite wasteful to extend the original user one argument to 3,
2 is enough.  We need to encode the original type of the first argument
because pointer conversions are useless in GIMPLE, and we need to record
a boolean whether it is for -ftrivial-auto-var-init=* or not.
But for recording the type we don't need the value (we've always used
zero) and for recording the boolean we don't need the type (we've always
used integer_type_node).
So, this patch merges the two into one.

2022-02-11  Jakub Jelinek  

* tree.cc (build_common_builtin_nodes): Fix up formatting in
__builtin_clear_padding decl creation.
* gimplify.cc (gimple_add_padding_init_for_auto_var): Encode
for_auto_init in the value of 2nd BUILT_IN_CLEAR_PADDING
argument rather than in 3rd argument.
(gimplify_call_expr): Likewise.  Fix up comment formatting.
* gimple-fold.cc (gimple_fold_builtin_clear_padding): Expect
2 arguments instead of 3, take for_auto_init from the value
of 2nd argument.

[Bug tree-optimization/117001] O3 auto tree loop vectorization produces incorrect output on armv8.2-a+sve

2024-10-07 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117001

--- Comment #2 from Andrew Pinski  ---
Works for me on the trunk:
[apinski@xeond2 upstream-cross-aarch64]$ ./install/bin/aarch64-linux-gnu-g++
-static t.cc
[apinski@xeond2 upstream-cross-aarch64]$ ./install-qemu/bin/qemu-aarch64 a.out
[0, 0, 0, 1, 0, 1, 1, 0]
[apinski@xeond2 upstream-cross-aarch64]$ ./install/bin/aarch64-linux-gnu-g++
-static t.cc -O3
[apinski@xeond2 upstream-cross-aarch64]$ ./install-qemu/bin/qemu-aarch64 a.out
[0, 0, 0, 1, 0, 1, 1, 0]
[apinski@xeond2 upstream-cross-aarch64]$ ./install/bin/aarch64-linux-gnu-g++
-static t.cc -O3 -march=armv8.2-a+sve
[apinski@xeond2 upstream-cross-aarch64]$ ./install-qemu/bin/qemu-aarch64 a.out
[0, 0, 0, 1, 0, 1, 1, 0]

[apinski@xeond2 upstream-cross-aarch64]$ ./install/bin/aarch64-linux-gnu-g++
-static t.cc -O3 -march=armv8.2-a+sve -fno-vect-cost-model
[apinski@xeond2 upstream-cross-aarch64]$ ./install-qemu/bin/qemu-aarch64 a.out
[0, 0, 0, 1, 0, 1, 1, 0]

[Bug tree-optimization/117001] O3 auto tree loop vectorization produces incorrect output on armv8.2-a+sve

2024-10-07 Thread Robert.Hardwick at arm dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117001

--- Comment #5 from Robert Hardwick  ---
Not working on 11.4.0, i'll try 11.5.0 as you suggest.

$ g++ --version
g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0

 g++ -O3 -march=armv8.2-a+sve test.cpp -o test
$:~/tools/pytorch/pytorch/argmin_test$ ./test
[0, 0, 0, 1, 0, 0, 1, 0]

---

yes apologies, forgot the #include  line in the reproducible example
code.

[Bug tree-optimization/117001] O3 auto tree loop vectorization produces incorrect output on armv8.2-a+sve

2024-10-07 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117001

--- Comment #3 from Andrew Pinski  ---
Note I needed to add the following 2 includes to get the testcase to compile:
```
#include 
#include 
```

[Bug tree-optimization/117001] O3 auto tree loop vectorization produces incorrect output on armv8.2-a+sve

2024-10-07 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117001

Andrew Pinski  changed:

   What|Removed |Added

 Status|UNCONFIRMED |WAITING
   Last reconfirmed||2024-10-07
 Ever confirmed|0   |1

--- Comment #4 from Andrew Pinski  ---
Since 11.5.0 was the last in the 11.x release, can you try out 11.5.0 or even
better GCC 14.2.0?

[Bug middle-end/117000] Inefficient code for 32-byte struct comparison (ptest missing)

2024-10-07 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117000

--- Comment #1 from Andrew Pinski  ---
>In GCC 14+ the compilation converges to test1 also in test2.


So what is happening in GCC 13 is SLP vectorizer is not able to vectorizer
test2 but GCC 14 is. The loop vectorizer is able to handle test1 in both.

[Bug sanitizer/116984] -fsanitize=bounds triggers within __builtin_dynamic_object_size()

2024-10-07 Thread qinzhao at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116984

qinzhao at gcc dot gnu.org changed:

   What|Removed |Added

 CC||qinzhao at gcc dot gnu.org

--- Comment #10 from qinzhao at gcc dot gnu.org ---
(In reply to Kees Cook from comment #4)
> (In reply to Andrew Pinski from comment #1)
> > I don't think so since &p->array[negative] is undefined behavior even inside
> > a dynamic boz.
> 
> Without counted_by, that is true. With counted_by all out of bounds
> calculations are defined to result in a 0 bdos.

The negative "counted_by" values will be treated as "zero" value, then the
corresponding SIZE of the FAM is zero. 

However, the "counted_by" value should NOT impact the array index, therefore,
for 
&p->array[negative]
since the index of the array is NEGATIVE, it's reasonable for the sanitizer to
report the error. 

so, from my understanding, the behavior of the testing case is correct.

[Bug target/117000] Inefficient code for 32-byte struct comparison (ptest missing)

2024-10-07 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117000

Andrew Pinski  changed:

   What|Removed |Added

  Component|middle-end  |target

--- Comment #2 from Andrew Pinski  ---
Otherwise it is a vector cost model issue.

[Bug tree-optimization/116990] [15 Regression] ICE on valid code at -O3 "-fno-tree-ccp -fno-tree-loop-im -fno-tree-dse" on x86_64-linux-gnu: in single_pred_edge, at basic-block.h:342

2024-10-07 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116990

--- Comment #5 from GCC Commits  ---
The master branch has been updated by Richard Biener :

https://gcc.gnu.org/g:b0b71618157ddac52266909978f331406f98f3a2

commit r15-4108-gb0b71618157ddac52266909978f331406f98f3a2
Author: Richard Biener 
Date:   Mon Oct 7 11:24:12 2024 +0200

tree-optimization/116990 - missed control flow check in
vect_analyze_loop_form

The following fixes checking for unsupported control flow in
vectorization to also cover the outer loop body.

PR tree-optimization/116990
* tree-vect-loop.cc (vect_analyze_loop_form): Check the current
loop body for control flow.

[Bug tree-optimization/116990] [14 Regression] ICE on valid code at -O3 "-fno-tree-ccp -fno-tree-loop-im -fno-tree-dse" on x86_64-linux-gnu: in single_pred_edge, at basic-block.h:342

2024-10-07 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116990

Richard Biener  changed:

   What|Removed |Added

   Target Milestone|15.0|14.3
  Known to work||15.0
Summary|[15 Regression] ICE on  |[14 Regression] ICE on
   |valid code at -O3   |valid code at -O3
   |"-fno-tree-ccp  |"-fno-tree-ccp
   |-fno-tree-loop-im   |-fno-tree-loop-im
   |-fno-tree-dse" on   |-fno-tree-dse" on
   |x86_64-linux-gnu: in|x86_64-linux-gnu: in
   |single_pred_edge, at|single_pred_edge, at
   |basic-block.h:342   |basic-block.h:342
   Priority|P3  |P2

--- Comment #6 from Richard Biener  ---
Fixed on trunk, queued for backporting.

[Bug tree-optimization/116982] [14/15 Regregression] ICE on valid code at -O3 with "-fno-tree-dce -fno-tree-dominator-opts -fno-tree-pre -fno-tree-dse -fno-tree-copy-prop -fno-tree-fre -fno-code-hoist

2024-10-07 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116982

--- Comment #4 from GCC Commits  ---
The master branch has been updated by Richard Biener :

https://gcc.gnu.org/g:9b86efd5210101954bd187c3aa8bb909610a5746

commit r15-4107-g9b86efd5210101954bd187c3aa8bb909610a5746
Author: Richard Biener 
Date:   Mon Oct 7 11:05:17 2024 +0200

tree-optimization/116982 - analyze scalar loop exit early

The following makes sure to discover the scalar loop IV exit during
analysis as failure to do so (if DCE and friends are disabled this
can happen due to if-conversion doing DCE and FRE on the if-converted
loop) would ICE later.

I refrained from larger refactoring to be able to eventually backport.

PR tree-optimization/116982
* tree-vectorizer.h (vect_analyze_loop): Pass in .LOOP_VECTORIZED
call.
(vect_analyze_loop_form): Likewise.
* tree-vect-loop.cc (vect_analyze_loop_form): Reject loops where we
cannot determine a IV exit for the scalar loop.
(vect_analyze_loop): Adjust.
* tree-vectorizer.cc (try_vectorize_loop_1): Likewise.
* tree-parloops.cc (gather_scalar_reductions): Likewise.

[Bug tree-optimization/116982] [14 Regregression] ICE on valid code at -O3 with "-fno-tree-dce -fno-tree-dominator-opts -fno-tree-pre -fno-tree-dse -fno-tree-copy-prop -fno-tree-fre -fno-code-hoisting

2024-10-07 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116982

Richard Biener  changed:

   What|Removed |Added

  Known to work||15.0
Summary|[14/15 Regregression] ICE   |[14 Regregression] ICE on
   |on valid code at -O3 with   |valid code at -O3 with
   |"-fno-tree-dce  |"-fno-tree-dce
   |-fno-tree-dominator-opts|-fno-tree-dominator-opts
   |-fno-tree-pre -fno-tree-dse |-fno-tree-pre -fno-tree-dse
   |-fno-tree-copy-prop |-fno-tree-copy-prop
   |-fno-tree-fre   |-fno-tree-fre
   |-fno-code-hoisting" on  |-fno-code-hoisting" on
   |x86_64-linux-gnu:   |x86_64-linux-gnu:
   |Segmentation fault  |Segmentation fault
   Keywords|needs-bisection |

--- Comment #5 from Richard Biener  ---
Fixed on trunk, queued for backporting.

[Bug middle-end/116933] various issues of -ftrivial-auto-var-init=zero with Ada

2024-10-07 Thread qing.zhao at oracle dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116933

--- Comment #18 from Qing Zhao  ---
> On Oct 7, 2024, at 11:34, ebotcazou at gcc dot gnu.org 
>  wrote:
> I see, thanks for investigation!  This was overlooked because the C family of
> compiler do not use the declaration built in common_builtin_nodes, but rather
> that derived from builtins.def:
> 
> DEF_GCC_BUILTIN(BUILT_IN_CLEAR_PADDING, "clear_padding",
> BT_FN_VOID_VAR, ATTR_NOTHROW_NONNULL_TYPEGENERIC_LEAF)
> 
> which accepts any number of arguments.

Oh, I see. That’s the reason this issue was just exposed now..
Thank you for fixing this issue.

[Bug tree-optimization/116583] vectorizable_slp_permutation cannot handle even/odd extract from VLA vector

2024-10-07 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116583

Richard Sandiford  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #16 from Richard Sandiford  ---
Hopefully fixed.

[Bug tree-optimization/116578] vectorizer SLP transition issues / dependences

2024-10-07 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116578
Bug 116578 depends on bug 116583, which changed state.

Bug 116583 Summary: vectorizable_slp_permutation cannot handle even/odd extract 
from VLA vector
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116583

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

[Bug middle-end/116933] various issues of -ftrivial-auto-var-init=zero with Ada

2024-10-07 Thread ebotcazou at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116933

--- Comment #17 from Eric Botcazou  ---
> Okay, by searching the history, looks like that the following patch forget
> to update the above routine when merging the 2nd and 3rd parameters for
> __builtin_clear_padding: 

I see, thanks for investigation!  This was overlooked because the C family of
compiler do not use the declaration built in common_builtin_nodes, but rather
that derived from builtins.def:

DEF_GCC_BUILTIN(BUILT_IN_CLEAR_PADDING, "clear_padding",
BT_FN_VOID_VAR, ATTR_NOTHROW_NONNULL_TYPEGENERIC_LEAF)

which accepts any number of arguments.

[Bug middle-end/116983] counted_by not used to identify singleton pointers

2024-10-07 Thread qinzhao at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116983

qinzhao at gcc dot gnu.org changed:

   What|Removed |Added

 CC||qinzhao at gcc dot gnu.org

--- Comment #1 from qinzhao at gcc dot gnu.org ---
(In reply to Kees Cook from comment #0)
> When counted_by is present in a structure, it means that the object must be
> a singleton.
> 
> For example:
> 
> struct counted {
> int counter;
> int array[] __attribute__((counted_by(counter)));
> };
> 
> struct notcounted {
> int counter;
> int array[];
> };
> 
> void __attribute__((noinline)) emit_length(size_t length)
> {
> printf("%zu\n", length);
> }
> 
> // This correctly cannot know size of p object, and returns SIZE_MAX
> void objsize_notcounted(struct notcounted *p)
> {
> emit_length(__builtin_dynamic_object_size(p, 1));
> } 
> 
> // This must be operating on a singleton, therefor the
> // return must be:
> // max(sizeof(*p),
> // sizeof(*p) + offsetof(typeof(*p), array) * p->counter)
> void objsize_counted(struct counted *p)
> {
> emit_length(__builtin_dynamic_object_size(p, 1));
> }

could you explicitly explain what's wrong in the current implementation?

[Bug rtl-optimization/117000] New: Inefficient code for 32-byte struct comparison (ptest missing)

2024-10-07 Thread chfast at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117000

Bug ID: 117000
   Summary: Inefficient code for 32-byte struct comparison (ptest
missing)
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: rtl-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: chfast at gmail dot com
  Target Milestone: ---

I was investigating why in GCC 13.3 the functions test1 and test2 produce
different x86 assembly. They only differ by the placement of the int -> U256
user defined conversion.

This lead to the discovery that the generated x86-64-v2 for all the examples is
not very efficient. E.g. for some reason a shift instruction is used (psrldq).

In GCC 14+ the compilation converges to test1 also in test2.

https://godbolt.org/z/r1vfcPone


using uint64_t = unsigned long;

struct U256
{
uint64_t words_[4]{};

U256(uint64_t v)
  : words_{v}
{}
};

bool eq(const U256& x, const U256& y)
{
uint64_t folded = 0;
for (int i = 0; i < 4; ++i)
folded |= (x.words_[i] ^ y.words_[i]);
return folded == 0;
}

bool eqi(const U256& x, uint64_t y)
{
return eq(x, U256(y));
}

auto test1(const U256& x)
{
return eqi(x, uint64_t(0));
}

bool test2(const U256& x)
{
return eq(x, U256(0));
}


test1(U256 const&):
movdqu  xmm1, XMMWORD PTR [rdi+16]
movdqu  xmm0, XMMWORD PTR [rdi]
por xmm0, xmm1
movdqa  xmm1, xmm0
psrldq  xmm1, 8
por xmm0, xmm1
movqrax, xmm0
testrax, rax
seteal
ret
test2(U256 const&):
mov rax, QWORD PTR [rdi]
or  rax, QWORD PTR [rdi+8]
or  rax, QWORD PTR [rdi+16]
or  rax, QWORD PTR [rdi+24]
seteal
ret

[Bug c++/117001] New: O3 auto tree loop vectorization produces incorrect output on armv8.2-a+sve

2024-10-07 Thread Robert.Hardwick at arm dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117001

Bug ID: 117001
   Summary: O3 auto tree loop vectorization produces incorrect
output on armv8.2-a+sve
   Product: gcc
   Version: 10.2.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: Robert.Hardwick at arm dot com
  Target Milestone: ---

We have seen some incorrect numbers being produced when O3 is enabled on Arm
Neoverse V1 ( armv8.2-a+sve ). I have reduced the problem down to a small
reproducer and identified that adding -fno-tree-loop-vectorize to gcc options
will produce the correct output.

It seems to happen when we have a C style array contained within a std::array
stucture and it occurs when auto loop vectorization is enabled.

This has been observed on 10.2.1 and 11.4.1

Reproducible example

  #include 

  typedef std::array my_type;

  // helpful to print output to stdout
  std::ostream& operator<<(std::ostream& stream, const my_type& vec) {
stream << "[";
for ( int j = 0; j < 2; j++){
  for (int i = 0; i != 4; i++) {
if (i != 0 || j != 0) {
  stream << ", ";
}
stream << vec[j][i];
  }
}
stream << "]";
return stream;
  }

  int main() {
my_type a = {{0, 0, 0, 1, 0, 0, 1, 0}};
my_type b = {{1, 1, 1, 1, 1, 1, 1, 1}};
my_type mask = {{0, 0, 0, 0, 0, 1, 0, 0}};

my_type result = {{0, 0, 0, 0, 0, 0, 0, 0}};

for (int i = 0; i < 2; i++) {
  for (int j = 0; j < 4; j++) {
if ( mask[i][j] != 0 )
{
  result[i][j] = b[i][j];
} else {
  result[i][j] = a[i][j];
}
  }
}

std::cout << result << std::endl;
  }


Observations

With -O3 -fno-tree-loop-vectorize -march=armv8.2-a+sve  output is INCORRECT

[0, 0, 0, 1, 0, 0, 1, 0]

with -O3 -march=armv8.2-a+sve output is CORRECT

[0, 0, 0, 1, 0, 1, 1, 0]


The operation should be doing the equivalent of 

result[i] = mask[i] ? b[i] : a[i]

So the 6th element ( at i=1, j=1 ) should be 1, not 0.

[Bug c++/117001] O3 auto tree loop vectorization produces incorrect output on armv8.2-a+sve

2024-10-07 Thread Robert.Hardwick at arm dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117001

--- Comment #1 from Robert Hardwick  ---
Apolgies, i've got that the wrong way around.

With -O3 -fno-tree-loop-vectorize -march=armv8.2-a+sve  output is CORRECT

[0, 0, 0, 1, 0, 1, 1, 0]

with -O3 -march=armv8.2-a+sve output is INCORRECT

[0, 0, 0, 1, 0, 0, 1, 0]

[Bug sanitizer/116984] -fsanitize=bounds triggers within __builtin_dynamic_object_size()

2024-10-07 Thread jakub at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116984

Jakub Jelinek  changed:

   What|Removed |Added

 Resolution|--- |INVALID
 Status|UNCONFIRMED |RESOLVED

--- Comment #9 from Jakub Jelinek  ---
(In reply to Kees Cook from comment #8)
> (In reply to Jakub Jelinek from comment #6)
> > counted_by is just another way how to get the initial whole
> > object dynamic size (similarly to fixed size automatic/static vars, malloc
> > etc., alloca, VLA definitions, whatever else provides the size of the whole
> > object).
> 
> I don't understand why the word "initial" is used there. It provides the
> _ongoing_ runtime bounds of the given array. Both the bounds sanitizer and
> __bdos were extended to make use of that information.

It is initial in the workflow of the object size pass, which has some
__builtin_object_size/__builtin_dynamic_object_size calls (explicit in the IL
or implicit e.g. from sanitization) and tracks object sizes through pointer
arithmetics and PHIs back to something which has a known object (or subobject)
size.
In your testcase, p->array has that known size because of counted_by attribute,
in other cases it could be a pointer initialized from malloc or similar calls,
in other cases it could be an object with non-dynamic size.

> > The rest is __builtin_dynamic_object_size dynamic tracking from that size
> > through pointer arithmetics etc.  And that doesn't change depending on what
> > the whole size has been computed with.
> 
> Part of the bounds sanitizer+__bdos work was to make sure that getting the
> size of invalidly indexed array element is 0 (and _not_ "don't know", since
> we *do* know: there is no element at an invalid location, therefore the size
> available at such an "address" is 0 bytes). This is so that the various

You can't get anything safely after invoking UB, there is nothing safe after
that, anything can happen.

The side-effects imply don't know special case was done for
__builtin_object_size (and later for __builtin_dynamic_object_size just
inherited it too) so that one could use the builtin e.g. in macros and don't
risk side-effects being evaluated multiple times.
So it is for cases like
__builtin_object_size (&a[i++], 0)
where we just say the function will return "don't know" and will not evaluate
the expression.  If there aren't side-effects in the C/C++ meaning (that
includes e.g. reading volatile vars, calling impure functions etc.), the
expression in the __bos/__bdos argument is evaluated at runtime and if there is
UB in it, the program is still invalid, and the compiler really can't guarantee
anything about that.
Consider if you have
int *pindex;
__builtin_dynamic_object_size (&a[*pindex], 0);
If the pindex is NULL or otherwise invalid pointer, the program will invoke UB
and certainly doesn't guarantee returning 0.  It can be diagnosed by UBSan
(e.g. the NULL case), it might be diagnosed by ASan (invalid pointer, some
cases of it), it might not be diagnosed at all and just crash.
Similarly, if you have
int idx1, idx2;
__builtin_dynamic_object_size (&a[idx1 + idx2], 0);
and idx1 + idx2 evaluation invokes signed integer overflow, again it will be UB
and anything can happen.  And the negative index on array ref is yet another
UB.

If __bos/__bdos doesn't have the side-effects which cause immediate folding of
the builtin with its arguments to the "don't know" value, then the expressions
are simply lowered to normal IL like anything else and can be instrumented for
UB like anything else, I don't think LLVM has any other representation for it
where you could safely invoke any kind of UB you'd like and all it would do is
change the return value of the __bos/__bdos call to 0.

[Bug tree-optimization/116974] omp inscan reduction not supported with SLP

2024-10-07 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116974

--- Comment #2 from Richard Biener  ---
One issue is that with SLP scheduling we're relying on data dependence to order
vector stmts in the correct order.  With omp scan we have scalar code like

  _12 = .GOMP_SIMD_LANE (simduid.2_6(D), 0);
  _13 = .GOMP_SIMD_LANE (simduid.2_6(D), 1);
  D.2789[_13] = 0;
  _15 = (long unsigned int) i_42;
  _16 = _15 * 4;
  _18 = a_17(D) + _16;
  _19 = *_18;
  r.0_20 = D.2789[_12];
  _21 = _19 + r.0_20;
  D.2789[_12] = _21;
  _23 = .GOMP_SIMD_LANE (simduid.2_6(D), 2);
  _24 = D.2790[_23];
  _25 = D.2789[_23];
  _26 = _24 + _25;
  D.2790[_23] = _26;
  D.2789[_23] = _26;
  _30 = b_29(D) + _16;
  r.0_31 = D.2789[_12];
  *_30 = r.0_31;

where vectorization of the in-scan reduction is currently performed by
vectorizable_scan_store on the D.2790[_23] = _26 store.  But vector
stmt order with respect to the other "SLP instance" defining D.2789
for non-SLP simply relies on us emitting vector stmts where scalar
stmts are but with SLP this only works because in the end we're using
the first scalar stmt as point to emit.

I think it would be preferable iff the temporaries would be elided as
SSA names and thus not appear as loads/stores.

I'm not sure whether this whole inscan / scan stuff would be necessary
if we'd support vectorizing reductions that are used inside of the loop.
Like by forcing them to be in-order and constructing the vector of
reduction values in each iteration.

That we key off the reduction code-gen from the store and not the
add isn't helpful.  Very likely a cleaner solution would at least
make the scan-reduction visible during SLP discovery so we can key
code generation off that stmt.

We'd want the reduction operands (_24 and _25 above) as well as that
initialization value (0 from the D.2789[_13] = 0 store) as children.

I do have a hackish patch to make most cases work with the current scheme
though.

[Bug rtl-optimization/114729] RISC-V SPEC2017 507.cactu excessive spillls with -fschedule-insns

2024-10-07 Thread vineetg at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114729

--- Comment #20 from Vineet Gupta  ---
The model schedule change (at tweak9) seems stable and showing very promising
result.

The hottest basic block's reg pressure drops down significantly

;; Pressure summary (bb 206): GR_REGS:313 FP_REGS:946
;; Pressure summary (bb 221): GR_REGS:312 FP_REGS:946

   vs.

;; Pressure summary (bb 206): GR_REGS:269 FP_REGS:285
;; Pressure summary (bb 221): GR_REGS:268 FP_REGS:285


riscv qemu icounts

  2,127,546,200,703 # fix2 tweak9  (~16% improv)
  2,544,112,250,412 # baseline

aarch64 qemu icount

  1,240,904,969,590 # fix2 tweak9 (~10% improv)
  1,371,320,697,809 # baseline

[Bug target/80881] Implement Windows native TLS

2024-10-07 Thread lh_mouse at 126 dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80881

--- Comment #26 from LIU Hao  ---
Comment on attachment 59290
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=59290
Newer patch for TLS support, incomplete

> +  "mov{l}\t{_tls_index(%%rip), %k0|%k0, DWORD PTR 
> [rip+_tls_index]}\;mov{q}\t{%%gs:88, %1|%1, QWORD PTR 
> gs:[88]}\;mov{q}\t{(%1,%0,8), %0|%0, QWORD PTR [%1+%0*8]}"

For i686 this would be (untested):

```
"mov{l}\t{_tls_index, %k0|%k0, DWORD PTR [_tls_index]}\;mov{l}\t{%%fs:44,
%1|%1, DWORD PTR fs:[44]}\;mov{l}\t{(%1,%0,4), %0|%0, DWORD PTR [%1+%0*4]}"
```

i.e. pointer size is 4 (instead of 8), TLS segment is FS (instead of GS), and
addresses of global symbols are absolute (instead of being RIP-relative).

[Bug rtl-optimization/114729] RISC-V SPEC2017 507.cactu excessive spillls with -fschedule-insns

2024-10-07 Thread vineetg at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114729

--- Comment #21 from Vineet Gupta  ---
The code is currently pushed to 
   https://github.com/vineetgarc/gcc/commits/topic-sched1/

[Bug target/117008] -march=native pessimization of 25% with bitset popcount

2024-10-07 Thread mattreecebentley at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117008

--- Comment #4 from Matt Bentley  ---
Yeah, I know, I mentioned that in the report.

It's not a bad benchmark, it's benchmarking access of individual consecutive
bits, not summing. The counting is merely for preventing the compiler from
optimizing out the loop.

I could equally make it benchmark random indices and I imagine the problem
would remain, though I haven't checked.

Still, your point is valid in that most non-benchmark code would likely have
more code around the access. Could potentially lead to misleading benchmark
results in other scenarios though. I haven't tested whether vector/array
indexing triggers the same bad vectorisation.

[Bug target/117008] -march=native pessimization of 25% with bitset popcount

2024-10-07 Thread mattreecebentley at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117008

--- Comment #5 from Matt Bentley  ---
(In reply to Andrew Pinski from comment #1)
> Can you provide the output of invoking g++ with -march=native and when
> compiling?

The .ii files were identical, so did you you mean .o files?

[Bug target/117006] New: [15 regression] GCC trunk generates larger code than GCC 14 at -Os

2024-10-07 Thread dccitaliano at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117006

Bug ID: 117006
   Summary: [15 regression] GCC trunk generates larger code than
GCC 14 at -Os
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: dccitaliano at gmail dot com
  Target Milestone: ---

Similar to https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116994

https://godbolt.org/z/64bxGvnrh

long patatino() {
long x = 0;
for (int i = 0; i < 5; ++i) {
while (x < 10) {
if ((x % 2 == 0 && x % 3 != 0) || (x % 5 == 0 && x > 5)) {
if (x > 7) {
x += 4;
} else {
x += 2;
}
} else {
x += 1;
}
while (x % 4 == 0) {
x += 3;
}
}
}
return x;
}


In particular, trunk generates

.L4:
lea rax, [rcx+4]
lea rdx, [rcx+2]
cmp rcx, 8
cmovl   rax, rdx
mov rcx, rax
jmp .L7


instead of:

.L4:
cmp rcx, 7
jle .L6
add rcx, 4
jmp .L7
.L6:
add rcx, 2
jmp .L7

[Bug target/117007] New: Poor optimiation for small vector constants needed for vector shift/rotate/mask genration.

2024-10-07 Thread munroesj at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117007

Bug ID: 117007
   Summary: Poor optimiation for small vector constants needed for
vector shift/rotate/mask genration.
   Product: gcc
   Version: 13.2.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: munroesj at gcc dot gnu.org
  Target Milestone: ---

Created attachment 59291
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=59291&action=edit
compile withe -m64 -O3 -mcpu=power8 or power9

For vector library codes there is frequent need toe "splat" small integer
constants needed for vector shifts, rotates, and mask generation. The
instructions exist (i.e. vspltisw, xxspltib, xxspltiw) supported by intrinsic.

But when these are used to provide constants VRs got other vector operations
the compiler goes out of is way to convert them to vector loads from .rodata.

This is especially bad for power8/9 as .rodata require 32-bit offsets and
always generate 3/4 instructions with a best case (L1 cache hit) latency of 9
cycles. The original splat immediate / shift implementation will run 2-4
instruction (with a good chance for CSE) and 4-6 cycles latency.

For example:

vui32_t
mask_sig_v2 ()
{
  vui32_t ones = vec_splat_u32(-1);
  vui32_t shft = vec_splat_u32(9);
  return vec_vsrw (ones, shft);
}

With GCC V6 generates:

01c0 :
 1c0:   8c 03 09 10 vspltisw v0,9
 1c4:   8c 03 5f 10 vspltisw v2,-1
 1c8:   84 02 42 10 vsrwv2,v2,v0
 1cc:   20 00 80 4e blr


While with GCC 13.2.1 generates:

01c0 :
 1c0:   00 00 4c 3c addis   r2,r12,0
1c0: R_PPC64_REL16_HA   .TOC.
 1c4:   00 00 42 38 addir2,r2,0
1c4: R_PPC64_REL16_LO   .TOC.+0x4
 1c8:   00 00 22 3d addis   r9,r2,0
1c8: R_PPC64_TOC16_HA   .rodata.cst16+0x20
 1cc:   00 00 29 39 addir9,r9,0
1cc: R_PPC64_TOC16_LO   .rodata.cst16+0x20
 1d0:   ce 48 40 7c lvx v2,0,r9
 1d4:   20 00 80 4e blr

this is the samel for -mcpu=power8/power9

it gets worse for vector functions that require multiple shift/mask constants.

For example:

// Extract the float sig
vui32_t
test_extsig_v2 (vf32_t vrb)
{
  const vui32_t zero = vec_splat_u32(0);
  const vui32_t sigmask = mask_sig_v2 ();
  const vui32_t expmask = mask_exp_v2 ();
#if 1
  vui32_t ones = vec_splat_u32(-1);
  const vui32_t hidden = vec_sub (sigmask, ones);
#else
  const vui32_t hidden = mask_hidden_v2 ();
#endif
  vui32_t exp, sig, normal;

  exp = vec_and ((vui32_t) vrb, expmask);
  normal = vec_nor ((vui32_t) vec_cmpeq (exp, expmask),
(vui32_t) vec_cmpeq (exp, zero));
  sig = vec_and ((vui32_t) vrb, sigmask);
  // If normal merger hidden-bit the sig-bits
  return (vui32_t) vec_sel (sig, normal, hidden);
}

GCC V6 generated:
0310 :
 310:   8c 03 bf 11 vspltisw v13,-1
 314:   8c 03 37 10 vspltisw v1,-9
 318:   8c 03 60 11 vspltisw v11,0
 31c:   06 0a 0d 10 vcmpgtub v0,v13,v1
 320:   84 09 00 10 vslwv0,v0,v1
 324:   8c 03 29 10 vspltisw v1,9
 328:   17 14 80 f1 xxland  vs44,vs32,vs34
 32c:   84 0a 2d 10 vsrwv1,v13,v1
 330:   86 00 0c 10 vcmpequw v0,v12,v0
 334:   86 58 8c 11 vcmpequw v12,v12,v11
 338:   80 6c a1 11 vsubuwm v13,v1,v13
 33c:   17 14 41 f0 xxland  vs34,vs33,vs34
 340:   17 65 00 f0 xxlnor  vs32,vs32,vs44
 344:   7f 03 42 f0 xxsel   vs34,vs34,vs32,vs45
 348:   20 00 80 4e blr

While GCC 13.2.1 -mcpu=power8 generates:
360 :
 360:   00 00 4c 3c addis   r2,r12,0
360: R_PPC64_REL16_HA   .TOC.
 364:   00 00 42 38 addir2,r2,0
364: R_PPC64_REL16_LO   .TOC.+0x4
 368:   00 00 02 3d addis   r8,r2,0
368: R_PPC64_TOC16_HA   .rodata.cst16+0x30
 36c:   00 00 42 3d addis   r10,r2,0
36c: R_PPC64_TOC16_HA   .rodata.cst16+0x20
 370:   8c 03 a0 11 vspltisw v13,0
 374:   00 00 08 39 addir8,r8,0
374: R_PPC64_TOC16_LO   .rodata.cst16+0x30
 378:   00 00 4a 39 addir10,r10,0
378: R_PPC64_TOC16_LO   .rodata.cst16+0x20
 37c:   00 00 22 3d addis   r9,r2,0
37c: R_PPC64_TOC16_HA   .rodata.cst16+0x40
 380:   e4 06 4a 79 rldicr  r10,r10,0,59
 384:   ce 40 20 7c lvx v1,0,r8
 388:   00 00 29 39 addir9,r9,0
388: R_PPC64_TOC16_LO   .rodata.cst16+0x40
 38c:   8c 03 17 10 vspltisw v0,-9
 390:   98 56 00 7c lxvd2x  vs0,0,r10
 394:   e4 06 29 79 rldicr  r9,r9,0,59
 398:   98 4e 80 7d lxvd2x  vs12,0,r9
 39c:   84 01 21 10 vslwv1,v1,v0
 3a0:   50 02 00 f0 xxswapd vs0,vs0
 3a4:   17 14 01 f0 xxland  vs32,vs33,vs34
 3a8:   50 62 8c f1 xxswapd vs12,vs12
 

[Bug target/117007] Poor optimiation for small vector constants needed for vector shift/rotate/mask genration.

2024-10-07 Thread bergner at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117007

Peter Bergner  changed:

   What|Removed |Added

 CC||bergner at gcc dot gnu.org,
   ||guojiufu at gcc dot gnu.org,
   ||linkw at gcc dot gnu.org,
   ||meissner at gcc dot gnu.org,
   ||segher at gcc dot gnu.org

--- Comment #1 from Peter Bergner  ---
Jeff, did any of your recent constant patches help with this or is this
something different?

[Bug c++/117004] New: Unexpected const variable type with decltype of non-type template parameter of deduced type

2024-10-07 Thread barry.revzin at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117004

Bug ID: 117004
   Summary: Unexpected const variable type with decltype of
non-type template parameter of deduced type
   Product: gcc
   Version: 14.1.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: barry.revzin at gmail dot com
  Target Milestone: ---

This is similar to #99631, but this example deals with scalars and still fails
on trunk. Attempted reduction:

#include 

template  struct integral_constant {
static constexpr int value = V; 
};

template 
using value_type = decltype(V);

void f() {
  [](T) {
// this fails on gcc (which thinks it's "const int")
// passes on clang
static_assert(std::same_as, int>);

// .. but this DOES pass on gcc???
static_assert(__is_same(value_type, int));

  }(integral_constant<0>());
}

value_type<0> is int, but gcc sometimes thinks for more complicated spellings
of 0 that it is const int.

[Bug libstdc++/117005] New: Parallel Mode algorithms need to qualify all calls

2024-10-07 Thread redi at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117005

Bug ID: 117005
   Summary: Parallel Mode algorithms need to qualify all calls
   Product: gcc
   Version: 15.0
Status: UNCONFIRMED
  Keywords: wrong-code
  Severity: normal
  Priority: P3
 Component: libstdc++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: redi at gcc dot gnu.org
  Target Milestone: ---

libstdc++-v3/includeparallel/algo.h is full of unqualified calls like:

  template
inline typename iterator_traits<_IIter>::difference_type
count(_IIter __begin, _IIter __end, const _Tp& __value)
{
  return __count_switch(__begin, __end, __value,
std::__iterator_category(__begin));
}

This performs ADL for __count_switch, which can cause the compiler to attempt
to complete incomplete classes to find associated namespaces (and it's just
slower to do ADL than qualified lookups).

Everything should be qualified except calls to std::swap (which were removed by
g:fc722a0ea442f0 anyway).

This causes at least one testsuite failure, but should fail elsewhere too if we
tested the other algos similarly:

FAIL: 24_iterators/indirect_callable/projected-adl.cc  -std=gnu++20 (test for
excess errors)
Excess errors:
/home/jwakely/src/gcc/gcc/libstdc++-v3/testsuite/24_iterators/indirect_callable/projected-adl.cc:23:
error: 'Holder::t' has incomplete type

[Bug middle-end/116983] counted_by not used to identify singleton pointers

2024-10-07 Thread jakub at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116983

Jakub Jelinek  changed:

   What|Removed |Added

 CC||jakub at gcc dot gnu.org

--- Comment #2 from Jakub Jelinek  ---
I'm not sure we can derive anything from the pointer type (especially but not
only because pointer conversions are useless in GIMPLE), like we e.g. try not
to derive alignment from it.  A lot of code will simply cast pointers to other
pointer types, what really matters is which pointer types have been
dereferenced.
In the FAM case &p->array[0] or similar represents that dereferencing, the
pointer then has to point to the FAM object, but mere declaration of pointer
type on function arg doesn't mean much, there could be cast in the caller and
in the callee too.

[Bug c++/97375] Unexpected top-level const retainment when declaring non-type template paramter with decltype(auto)

2024-10-07 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97375

Andrew Pinski  changed:

   What|Removed |Added

   Target Milestone|--- |12.0
 Status|NEW |RESOLVED
 Resolution|--- |FIXED

--- Comment #3 from Andrew Pinski  ---
(In reply to Marek Polacek from comment #2)
> Fixed on trunk by r12-1224.  Maybe we want to backport that to 11 too.

GCC 11.5.0 was the final release from the GCC 11 branch so closing as fixed.

[Bug c++/115314] auto template parameter has const qualifier on it even though the original does not

2024-10-07 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115314

Andrew Pinski  changed:

   What|Removed |Added

 CC||barry.revzin at gmail dot com

--- Comment #2 from Andrew Pinski  ---
*** Bug 117004 has been marked as a duplicate of this bug. ***

[Bug c++/117004] Unexpected const variable type with decltype of non-type template parameter of deduced type

2024-10-07 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117004

Andrew Pinski  changed:

   What|Removed |Added

 Resolution|--- |DUPLICATE
 Status|UNCONFIRMED |RESOLVED

--- Comment #2 from Andrew Pinski  ---
(In reply to Andrew Pinski from comment #1)
> I think there is a dup of this one around.

Yes PR 115314.

*** This bug has been marked as a duplicate of bug 115314 ***

[Bug target/117006] [15 regression] GCC trunk generates larger code than GCC 14 at -Os

2024-10-07 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117006

Andrew Pinski  changed:

   What|Removed |Added

   Keywords||missed-optimization

--- Comment #1 from Andrew Pinski  ---
For the reduced testcase:
```
int f(long a, int b)
{
if (b > 7)
return a+4;
return a+2;
}
```

GCC 14 is smaller by 1 byte than GCC 15.

But for a slightly different testcase:
```
int h(long);
int f(long a, int b)
{
if (b > 7)
 a+=4;
else
 a+=2;
  return h(a);
}
```
The trunk (cmov) is smaller by 1 byte.

So this looks like it is just by accident.

-Os sometimes is just heurstics and one or 2 bytes different here in both
directions might be a wash overall.

So it is hard to tell if this will cause a real issue for -Os.

[Bug target/117008] -march=native pessimization of 25% with bitset popcount

2024-10-07 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117008

Andrew Pinski  changed:

   What|Removed |Added

 Ever confirmed|1   |0
 Blocks||53947
 Status|NEW |UNCONFIRMED

--- Comment #3 from Andrew Pinski  ---
Note your `total+=values[index];` loop could be reduced down to just `total +=
values.count();` and that will over 10x faster.


I am not sure sure if this is useful benchmark either. because count uses
popcount directly. Maybe GCC could detect the popcount here but I am not sure.
LLVM does a slightly better job at vectorizing the loop but still messes it up.

Plus once you add other code around values[index], the vectorizer will no
longer kick in so the slow down is only for this bad micro-benchmark.


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
[Bug 53947] [meta-bug] vectorizer missed-optimizations

[Bug target/117008] -march=native pessimization of 25% with bitset

2024-10-07 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117008

Andrew Pinski  changed:

   What|Removed |Added

   Keywords||missed-optimization
 Status|UNCONFIRMED |NEW
 Ever confirmed|0   |1
   Last reconfirmed||2024-10-08
 Target||x86_64-*-*

--- Comment #2 from Andrew Pinski  ---
Looks like the reduction loop is vectorized and that is causing the slow down.

Semi reduced (unincluded) testcase:
```
#include 
void g(std::bitset<1280> &);

int f()
{
  unsigned int total = 0;
  std::bitset<1280> values;
  g(values);
  for (unsigned int index = 0; index != 1280; ++index)
total += values[index];
  return total ;
}
```

For Linux, you need `-m32 -O2 -mavx2` (-m32 since it uses long and for mingw
that is 32bits while for linux it is 64bits and that does not get vectorized).

[Bug target/80881] Implement Windows native TLS

2024-10-07 Thread tanksherman27 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80881

--- Comment #27 from Julian Waters  ---
X(In reply to LIU Hao from comment #26)
> Comment on attachment 59290 [details]
> Newer patch for TLS support, incomplete
> 
> > +  "mov{l}\t{_tls_index(%%rip), %k0|%k0, DWORD PTR 
> > [rip+_tls_index]}\;mov{q}\t{%%gs:88, %1|%1, QWORD PTR 
> > gs:[88]}\;mov{q}\t{(%1,%0,8), %0|%0, QWORD PTR [%1+%0*8]}"
> 
> For i686 this would be (untested):
> 
> ```
> "mov{l}\t{_tls_index, %k0|%k0, DWORD PTR [_tls_index]}\;mov{l}\t{%%fs:44,
> %1|%1, DWORD PTR fs:[44]}\;mov{l}\t{(%1,%0,4), %0|%0, DWORD PTR [%1+%0*4]}"
> ```
> 
> i.e. pointer size is 4 (instead of 8), TLS segment is FS (instead of GS),
> and addresses of global symbols are absolute (instead of being RIP-relative).

I think I remember clang using __tls_index instead of _tls_index for 32 bit as
well, but that's the only difference I remember. On another note, Cygwin
doesn't support TLS natively, right? Eric modified the stopgap patch above and
he put some definitions in cygming.h, since he expects it to support Cygwin as
well, but I vaguely remember you saying something about Cygwin not having the
support for this

[Bug target/80881] Implement Windows native TLS

2024-10-07 Thread lh_mouse at 126 dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80881

--- Comment #28 from LIU Hao  ---
(In reply to Julian Waters from comment #27)
> I think I remember clang using __tls_index instead of _tls_index for 32 bit
> as well, but that's the only difference I remember. On another note, Cygwin

Yes, you are right. Solely for i686, external symbols have to be prefixed by an
underscore.


> doesn't support TLS natively, right? Eric modified the stopgap patch above
> and he put some definitions in cygming.h, since he expects it to support
> Cygwin as well, but I vaguely remember you saying something about Cygwin not
> having the support for this

Correct, because the Cygwin CRT doesn't have a TLS directory. You can use
`objdump -h` to print PE headers of a Cygwin executable, and there is no `.tls`
section.

An application may provide its own TLS directory, but it's not default.

[Bug target/117010] New: [nvptx] Incorrect ptx code-gen for C++ code with templates

2024-10-07 Thread prathamesh3492 at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117010

Bug ID: 117010
   Summary: [nvptx] Incorrect ptx code-gen for C++ code with
templates
   Product: gcc
   Version: 15.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: prathamesh3492 at gcc dot gnu.org
  Target Milestone: ---

Hi,
For the test-case adapted from pr96390.C:

template struct V
{
  int version_called;  V ()
  {
version_called = 1;
  }
};

void foo()
{
  V<0> v;
}

int
main ()
{
#pragma omp target
  {
foo ();
  }
  return 0;
}

Compiling with -O0 -fopenmp -foffload=nvptx-none -foffload=-malias
-foffload=-mptx=6.3, results in following error:

ptxas ./a.xnvptx-none.mkoffload.o, line 45; error   : Call to '_ZN1VILi0EEC1Ev'
requires call prototype
ptxas ./a.xnvptx-none.mkoffload.o, line 45; error   : Unknown symbol
'_ZN1VILi0EEC1Ev'
ptxas fatal   : Ptx assembly aborted due to errors
nvptx-as: ptxas returned 255 exit status 

The reason this happens is because in PTX code-gen, we call _ZN1VILi0EEC1Ev
from _Z3foov, however  _ZN1VILi0EEC1Ev is not defined anywhere. Instead it
contains the following definition:

// BEGIN FUNCTION DECL: _ZN1VILi0EEC2Ev
.func _ZN1VILi0EEC2Ev (.param.u64 %in_ar0);

where _ZN1VILi0EEC1Ev is an (implicit) alias for _ZN1VILi0EEC2Ev in callgraph.

Thanks,
Prathamesh

[Bug middle-end/117003] pr104783.c is miscompiled with offloading and results in segmentation fault during host-only execution for -O1 and above

2024-10-07 Thread tschwinge at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117003

Thomas Schwinge  changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |DUPLICATE

--- Comment #1 from Thomas Schwinge  ---
Duplicate of PR105001 "If executing with non-nvptx offloading, but nvptx
offloading compilation is enabled: FAIL: libgomp.c/pr104783.c execution test",
as far as I can tell.

*** This bug has been marked as a duplicate of bug 105001 ***

[Bug middle-end/105001] If executing with non-nvptx offloading, but nvptx offloading compilation is enabled: FAIL: libgomp.c/pr104783.c execution test

2024-10-07 Thread tschwinge at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105001

Thomas Schwinge  changed:

   What|Removed |Added

 CC||prathamesh3492 at gcc dot 
gnu.org

--- Comment #3 from Thomas Schwinge  ---
*** Bug 117003 has been marked as a duplicate of this bug. ***

[Bug tree-optimization/117000] Inefficient code for 32-byte struct comparison (ptest missing)

2024-10-07 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117000

Richard Biener  changed:

   What|Removed |Added

  Component|target  |tree-optimization
 Status|UNCONFIRMED |ASSIGNED
Version|unknown |13.3.0
   Last reconfirmed||2024-10-08
 Ever confirmed|0   |1

--- Comment #3 from Richard Biener  ---
In particular we miss the fact that

  _29 = .REDUC_IOR (vect_folded_10.32_23);
  _12 = _29 == 0;

could be optimized to

  _12 = vect_folded_10.32_23 == {0, 0, ... };

it's probably too late for RTL to realize this.  Some pattern in match.pd
could handle this, like

(for cmp (eq ne)
 (simplify
  (cmp (IFN_REDUC_IOR @0) integer_zerop)
  (cmp @0 { build_zero_cst (TREE_TYPE (@0)); } )))

results in

_Z5test1RK4U256:
.LFB5:
.cfi_startproc
movdqu  (%rdi), %xmm0
movdqu  16(%rdi), %xmm1
por %xmm1, %xmm0
ptest   %xmm0, %xmm0
sete%al
ret

_Z5test2RK4U256:
.LFB6:
.cfi_startproc
movdqu  16(%rdi), %xmm0
movdqu  (%rdi), %xmm1
por %xmm1, %xmm0
ptest   %xmm0, %xmm0
sete%al
ret

  1   2   >