[Bug target/89213] Optimize V2DI shifts by a constant on power8 & above systems.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89213 Michael Meissner changed: What|Removed |Added Resolution|--- |FIXED Status|ASSIGNED|RESOLVED --- Comment #10 from Michael Meissner --- Patch committed on September 18th, 2024.
[Bug middle-end/116997] [13/14/15 Regression] Wrong bitfield accesses since r13-3219-g25413fdb2ac249
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116997 --- Comment #1 from avieira at gcc dot gnu.org --- Had a look at this and I see similar codegen for aarch64 when compiling for big-endian. If I disable tree-ifcvt (-fdisable-tree-ifvt) I end up with: MEM [(void *)Ptr.0_1] = 30071062528; Which is the behaviour we want to see. This is achieved by store-merging, we should have a look at how that pass handles this. When ifcvt is enabled, lower_bitfield generates: _ifc__24 = Ptr.0_1->D.4418; _ifc__25 = BIT_INSERT_EXPR <_ifc__24, 3, 0 (11 bits)>; That gets optimized to: Ptr.0_1->D.4418 = 3; Whereas I expected that to get big-endiannised (not a word I know) to = 0x60. I was also surprised to see that the front-end already transforms: 'if (GlobS.f2 != 3)' into '(BIT_FIELD_REF & 4292870144) != 6291456' Anyway that's as far as I got, not sure what the right solution is, should BIT_INSERT_EXPR <_ifc__24, 3, 0 (11 bits)> not fold to 0x6 ?
[Bug middle-end/117003] New: pr104783.c is miscompiled with offloading and results in segmentation fault during host-only execution for -O1 and above
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117003 Bug ID: 117003 Summary: pr104783.c is miscompiled with offloading and results in segmentation fault during host-only execution for -O1 and above Product: gcc Version: 15.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: prathamesh3492 at gcc dot gnu.org Target Milestone: --- Hi, The following test (libgomp/pr104783.c): int main (void) { unsigned val = 0; #pragma omp target map(tofrom: val) #pragma omp simd for (int i = 0 ; i < 1 ; i++) { #pragma omp atomic update val = val + 1; } if (val != 1) __builtin_abort (); return 0; } Compiling with -O1 -fopenmp -foffload=nvptx-none and forcing host-only execution results in segfault. The issue here seems to be during omp lowering. D.4569 = .GOMP_USE_SIMT (); if (D.4569 != 0) goto ; else goto ; : { void * simduid.2; void * .omp_simt.3; int i; simduid.2 = .GOMP_SIMT_ENTER (simduid.2); .omp_simt.3 = .GOMP_SIMT_ENTER_ALLOC (simduid.2); #pragma omp simd _simduid_(simduid.2) _simt_ linear(i:1) for (i = 0; i < 1; i = i + 1) D.4577 = .omp_data_i->val; #pragma omp atomic_load relaxed D.4558 = *D.4577 D.4559 = D.4558 + 1; #pragma omp atomic_store relaxed (D.4559) #pragma omp continue (i, i) .GOMP_SIMT_EXIT (.omp_simt.3); #pragma omp return(nowait) } goto ; : { int i; #pragma omp simd linear(i:1) for (i = 0; i < 1; i = i + 1) #pragma omp atomic_load relaxed D.4558 = *&*D.4577 D.4559 = D.4558 + 1; #pragma omp atomic_store relaxed (D.4559) #pragma omp continue (i, i) #pragma omp return(nowait) } : ... In the following stmt in simd code-path: #pragma omp atomic_load relaxed D.4558 = *&*D.4577 D.4577 is uninitialized. D.4577 is initialized in the sibling if-block containing simt code-path: D.4577 = .omp_data_i->val; which doesn't reach the use in atomic_load relaxed stmt, and gets expanded to following in ompexp dump: : __atomic_fetch_add_4 (D.4590, 1, 0); and thus we end up passing uninitialized pointer to __atomic_fetch_add_4, which results in segfault. While in the corresponding simt code-path, it's initialized correctly: : D.4590 = .omp_data_i->val; __atomic_fetch_add_4 (D.4590, 1, 0); Thanks, Prathamesh
[Bug middle-end/116997] [13/14/15 Regression] Wrong bitfield accesses since r13-3219-g25413fdb2ac249
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116997 Andrew Pinski changed: What|Removed |Added Status|UNCONFIRMED |NEW Ever confirmed|0 |1 Last reconfirmed||2024-10-07 --- Comment #2 from Andrew Pinski --- >I was also surprised to see that the front-end already transforms: 'if (GlobS.f2 != 3)' into '(BIT_FIELD_REF & 4292870144) != 6291456' That is from fold. Specifically optimize_bit_field_compare .
[Bug target/80881] Implement Windows native TLS
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80881 --- Comment #25 from Julian Waters --- Created attachment 59290 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=59290&action=edit Newer patch for TLS support, incomplete
[Bug middle-end/116997] [13/14/15 Regression] Wrong bitfield accesses since r13-3219-g25413fdb2ac249
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116997 --- Comment #3 from Andrew Pinski --- https://gcc.gnu.org/pipermail/gcc-patches/2020-January/537612.html I can't remember if the fix was committed or not ...
[Bug d/117002] New: lifetime.d: In function ‘_d_newclassT’: error: size of array element is not a multiple of its alignment with -Warray-bounds and -O2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117002 Bug ID: 117002 Summary: lifetime.d: In function ‘_d_newclassT’: error: size of array element is not a multiple of its alignment with -Warray-bounds and -O2 Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: d Assignee: ibuclaw at gdcproject dot org Reporter: a.horodniceanu at proton dot me Target Milestone: --- Reported first at https://bugs.gentoo.org/940750 and reduced to: -- module object; extern(C++) class Foo { ubyte[4] not_multiple_of_8; } extern(C) int main () { // avoid optimizations void* p = cast(void*)(0xdeadbeef); auto init = __traits(initSymbol, Foo); p[0 .. init.length] = init[]; return 0; } -- Compile with: -- $ gdc repro.d -Warray-bounds -O2 /root/repro.d: In function ‘main’: /root/repro.d:8:5: error: size of array element is not a multiple of its alignment 8 | int main () { | ^ -- The code is a reduction of _d_newclassT in core/lifetime.d with the original error being: -- $ cat repro.d class Foo { ubyte[4] not_multiple_of_8; } void foo () { new Foo(); } $ gdc repro.d -Warray-bounds -O2 /usr/lib/gcc/x86_64-pc-linux-gnu/13/include/d/core/lifetime.d: In function ‘_d_newclassT’: /usr/lib/gcc/x86_64-pc-linux-gnu/13/include/d/core/lifetime.d:2725:3: error: size of array element is not a multiple of its alignment 2725 | T _d_newclassT(T)() @trusted | ^ --
[Bug c/116735] ICE in build_counted_by_ref
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116735 --- Comment #4 from GCC Commits --- The master branch has been updated by Qing Zhao : https://gcc.gnu.org/g:9a17e6d03c6ed53e3b2dfd2c3ff9b1066ffa97b9 commit r15-4122-g9a17e6d03c6ed53e3b2dfd2c3ff9b1066ffa97b9 Author: qing zhao Date: Mon Sep 30 18:29:29 2024 + c: ICE in build_counted_by_ref [PR116735] When handling the counted_by attribute, if the corresponding field doesn't exit, in additiion to issue error, we should also remove the already added non-existing "counted_by" attribute from the field_decl. PR c/116735 gcc/c/ChangeLog: * c-decl.cc (verify_counted_by_attribute): Remove the attribute when error. gcc/testsuite/ChangeLog: * gcc.dg/flex-array-counted-by-9.c: New test.
[Bug middle-end/117002] lifetime.d: In function ‘_d_newclassT’: error: size of array element is not a multiple of its alignment with -Warray-bounds and -O2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117002 Andrew Pinski changed: What|Removed |Added Last reconfirmed||2024-10-07 Ever confirmed|0 |1 See Also||https://gcc.gnu.org/bugzill ||a/show_bug.cgi?id=116481 Status|UNCONFIRMED |NEW --- Comment #1 from Andrew Pinski --- 100% related to PR 116481 which is about arrays of function types.
[Bug tree-optimization/116583] vectorizable_slp_permutation cannot handle even/odd extract from VLA vector
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116583 --- Comment #13 from GCC Commits --- The trunk branch has been updated by Richard Sandiford : https://gcc.gnu.org/g:2abd04d01bc4e18158c785e75c91576b836f3ba6 commit r15-4113-g2abd04d01bc4e18158c785e75c91576b836f3ba6 Author: Richard Sandiford Date: Mon Oct 7 13:03:04 2024 +0100 vect: Restructure repeating_p case for SLP permutations [PR116583] The repeating_p case previously handled the specific situation in which the inputs have N lanes and the output has N lanes, where N divides the number of vector elements. In that case, every output uses the same permute vector. The code was therefore structured so that the outer loop only constructed one permute vector, with an inner loop generating as many VEC_PERM_EXPRs from it as required. However, the main patch for PR116583 adds support for cycling through N permute vectors, rather than just having one. The current structure doesn't really handle that case well. (We'd need to interleave the results after generating them, which sounds a bit fragile.) This patch instead makes the transform phase calculate each output vector's permutation explicitly, like for the !repeating_p path. As a bonus, it gets rid of one use of SLP_TREE_NUMBER_OF_VEC_STMTS. This arguably undermines one of the justifications for using repeating_p for constant-length vectors: that the repeating_p path involved less work than the !repeating_p path. That justification does still hold for the analysis phase, though, and that should be the more time-sensitive part. And the other justification -- to get more coverage of the code -- still applies. So I'd prefer that we continue to use repeating_p for constant-length vectors unless that causes a known missed optimisation. gcc/ PR tree-optimization/116583 * tree-vect-slp.cc (vectorizable_slp_permutation_1): Remove the noutputs_per_mask inner loop and instead generate a separate permute vector for each output.
[Bug tree-optimization/116583] vectorizable_slp_permutation cannot handle even/odd extract from VLA vector
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116583 --- Comment #11 from GCC Commits --- The trunk branch has been updated by Richard Sandiford : https://gcc.gnu.org/g:1048ebbbdc98a5928a974356d7f4244603b6bd32 commit r15-4110-g1048ebbbdc98a5928a974356d7f4244603b6bd32 Author: Richard Sandiford Date: Mon Oct 7 13:03:02 2024 +0100 aarch64: Handle SVE modes in aarch64_evpc_reencode [PR116583] For Advanced SIMD modes, aarch64_evpc_reencode tests whether a permute in a narrow element mode can be done more cheaply in a wider mode. For example, { 0, 1, 8, 9, 4, 5, 12, 13 } on V8HI is a natural TRN1 on V4SI ({ 0, 4, 2, 6 }). This patch extends the code to handle SVE data and predicate modes as well. This is a prerequisite to getting good results for PR116583. gcc/ PR target/116583 * config/aarch64/aarch64.cc (aarch64_coalesce_units): New function, extending the Advanced SIMD handling from... (aarch64_evpc_reencode): ...here to SVE data and predicate modes. gcc/testsuite/ PR target/116583 * gcc.target/aarch64/sve/permute_1.c: New test. * gcc.target/aarch64/sve/permute_2.c: Likewise. * gcc.target/aarch64/sve/permute_3.c: Likewise. * gcc.target/aarch64/sve/permute_4.c: Likewise.
[Bug tree-optimization/116583] vectorizable_slp_permutation cannot handle even/odd extract from VLA vector
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116583 --- Comment #12 from GCC Commits --- The trunk branch has been updated by Richard Sandiford : https://gcc.gnu.org/g:1732298d51028ae50a802e538df5d7249556255d commit r15-4112-g1732298d51028ae50a802e538df5d7249556255d Author: Richard Sandiford Date: Mon Oct 7 13:03:03 2024 +0100 vect: Variable lane indices in vectorizable_slp_permutation_1 [PR116583] The main patch for PR116583 needs to create variable indices into an input vector. This pre-patch changes the types to allow that. There is no pretty-print format for poly_uint64 because of issues with passing C++ objects through "...". gcc/ PR tree-optimization/116583 * tree-vect-slp.cc (vectorizable_slp_permutation_1): Using poly_uint64 for scalar lane indices.
[Bug tree-optimization/116583] vectorizable_slp_permutation cannot handle even/odd extract from VLA vector
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116583 --- Comment #14 from GCC Commits --- The trunk branch has been updated by Richard Sandiford : https://gcc.gnu.org/g:8157f3f2d211bfbf53fbf8dd209b47ce583f4142 commit r15-4114-g8157f3f2d211bfbf53fbf8dd209b47ce583f4142 Author: Richard Sandiford Date: Mon Oct 7 13:03:04 2024 +0100 vect: Support more VLA SLP permutations [PR116583] This is the main patch for PR116583. Previously, we only supported VLA SLP permutations for which the output and inputs have the same number of lanes, and for which that number of lanes divides the number of vector elements. The patch extends this to handle: (1) "packs" of a single 2N-vector input into an N-vector output (2) "unpacks" of N-vector inputs into an XN-vector output Hopefully the comments in the code explain the approach. The contents of the: for (unsigned i = 0; i < ncopies; ++i) loop do not change; the patch simply adds an outer loop around it. The patch removes the XFAIL in slp-13.c and also improves the SVE vect.exp results with vect-force-slp=1. I haven't added new tests specifically for this, since presumably the existing ones will cover it once the SLP switch is flipped. gcc/ PR tree-optimization/116583 * tree-vect-slp.cc (vectorizable_slp_permutation_1): Handle variable-length pack and unpack permutations. gcc/testsuite/ PR tree-optimization/116583 * gcc.dg/vect/slp-13.c: Remove xfail for vect_variable_length. * gcc.dg/vect/slp-13-big-array.c: Likewise.
[Bug tree-optimization/116583] vectorizable_slp_permutation cannot handle even/odd extract from VLA vector
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116583 --- Comment #15 from GCC Commits --- The trunk branch has been updated by Richard Sandiford : https://gcc.gnu.org/g:03299164830e19405b35a5fa862e248df4ea01e2 commit r15-4115-g03299164830e19405b35a5fa862e248df4ea01e2 Author: Richard Sandiford Date: Mon Oct 7 13:03:05 2024 +0100 vect: Add more dump messages for VLA SLP permutation [PR116583] Taking the !repeating_p route for VLA vectors causes analysis to fail, but it wasn't clear from the dump files when this had happened, and which node caused it. gcc/ PR tree-optimization/116583 * tree-vect-slp.cc (vectorizable_slp_permutation_1): Add more dump messages.
[Bug c/116735] ICE in build_counted_by_ref
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116735 qinzhao at gcc dot gnu.org changed: What|Removed |Added Resolution|--- |FIXED Status|ASSIGNED|RESOLVED --- Comment #5 from qinzhao at gcc dot gnu.org --- fixed into GCC15
[Bug ada/114964] Ada Address_To_Access_Conversions gnat_to_gnu_entity internal error
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114964 Ken Burtch changed: What|Removed |Added Resolution|--- |WONTFIX Status|WAITING |RESOLVED --- Comment #6 from Ken Burtch --- You're statement does not make sense to me, as the bug is clearly reproducible but your process is not adequate. I will close this bug report.
[Bug middle-end/116983] counted_by not used to identify singleton pointers
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116983 --- Comment #3 from Jakub Jelinek --- Plus the useless pointer conversions in GIMPLE can mean that void *foo (int); struct counted { int counter; int array[] __attribute__((counted_by(counter))); }; struct notcounted { int counter; int array[]; }; int bar (int x) { void *p = foo (x); if (x & 1) { struct counted *q = (struct counted *) p; use (q); return __builtin_dynamic_object_size (q, 0); } else { struct notcounted *r = (struct notcounted *) p; use2 (r); return __builtin_dynamic_object_size (r, 0); } } with CSE/SCCVN whether one gets actually struct counted * or struct notcounted * or void * pointer type is a lottery.
[Bug c++/117004] Unexpected const variable type with decltype of non-type template parameter of deduced type
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117004 --- Comment #1 from Andrew Pinski --- I think there is a dup of this one around.
[Bug c++/117008] New: -march=native pessimization of 25% with bitset
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117008 Bug ID: 117008 Summary: -march=native pessimization of 25% with bitset Product: gcc Version: 13.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: mattreecebentley at gmail dot com Target Milestone: --- Created attachment 59292 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=59292&action=edit ii file Overview: Found a sequence of code using bitset where using -march=native and -O2 is 25% slower than just -O2, under a Intel i7-9750H. Repeatable, also occurs on an Intel i7-3770 but with a much lower decrease in performance (around ~5%). At -O2 runtime duration is ~15 seconds, -march=native;-O2 it's ~20 seconds. Have used a PCG-based rand() in the code since regular rand() slows down the program by 5x (difference in runtime duration between -march=native;-O2 and -O2 is still the same, but percentage change is dramatically influenced by rand taking all the CPU time). Code adds values to a total to prevent the code being optimized out. And yes, .count() would've probably been more typical code. GCC version: 13.2.0 System type: x86_64-w64-mingw32, windows, x64 Configured with: ../src/configure --enable-languages=c,c++ --build=x86_64-w64-mingw32 --host=x86_64-w64-mingw32 --target=x86_64-w64-mingw32 --disable-multilib --prefix=/e/temp/gcc/dest --with-sysroot=/e/temp/gcc/dest --disable-libstdcxx-pch --disable-libstdcxx-verbose --disable-nls --disable-shared --disable-win32-registry --enable-threads=posix --enable-libgomp --with-zstd=/c/mingw The complete command line that triggers the bug, for -O2 only: C:/programming/libraries/nuwen/bin/g++.exe -c "C:/programming/workspaces/march_pessimisation_demo/march_pessimisation_demo.cpp" -O2 -std=c++23 -s -save-temps -DNDEBUG -o build-Release/march_pessimisation_demo.cpp.o -I. -I. C:/programming/libraries/nuwen/bin/g++.exe -o build-Release\bin\march_pessimisation_demo.exe @build-Release/ObjectsList.txt -L. -O2 -s Or for -march=native: C:/programming/libraries/nuwen/bin/g++.exe -c "C:/programming/workspaces/march_pessimisation_demo/march_pessimisation_demo.cpp" -O2 -march=native -std=c++23 -s -save-temps -DNDEBUG -o build-Release/march_pessimisation_demo.cpp.o -I. -I. C:/programming/libraries/nuwen/bin/g++.exe -o build-Release\bin\march_pessimisation_demo.exe @build-Release/ObjectsList.txt -L. -s The compiler output (error messages, warnings, etc.): No errors or warnings.
[Bug middle-end/117009] New: Wall should be in common.opt rather than the language specific .opt
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117009 Bug ID: 117009 Summary: Wall should be in common.opt rather than the language specific .opt Product: gcc Version: 15.0 Status: UNCONFIRMED Keywords: internal-improvement Severity: normal Priority: P3 Component: middle-end Assignee: pinskia at gcc dot gnu.org Reporter: pinskia at gcc dot gnu.org Target Milestone: --- While helping jemarch (on IRC) with language specific option handling, it was noticed that Wall is defined in some .opt files. It should be in defined in the common.opt and added to the language .opt iff there needs some special handling non EnabledBy for it. Right now it is in: c-family/c.opt:Wall d/lang.opt:Wall fortran/lang.opt:Wall go/lang.opt:Wall m2/lang.opt:Wall rust/lang.opt:Wall But it should only be in c-family/c.opt, d/lang.opt and m2/lang.opt: c-family/c-opts.cc:case OPT_Wall: d/d-lang.cc:case OPT_Wall: m2/gm2-lang.cc:case OPT_Wall: Note Wextra does not need to be in none of these either: c-family/c.opt:Wextra d/lang.opt:Wextra fortran/lang.opt:Wextra Because there is no special handling of OPT_Wextra.
[Bug middle-end/117009] Wall should be in common.opt rather than the language specific .opt
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117009 Andrew Pinski changed: What|Removed |Added Last reconfirmed||2024-10-08 Ever confirmed|0 |1 Status|UNCONFIRMED |ASSIGNED --- Comment #1 from Andrew Pinski --- Mine to handle.
[Bug target/117008] -march=native pessimization of 25% with bitset
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117008 --- Comment #1 from Andrew Pinski --- Can you provide the output of invoking g++ with -march=native and when compiling?
[Bug analyzer/116996] New: Missed Detection of Null Pointer Dereference Issues
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116996 Bug ID: 116996 Summary: Missed Detection of Null Pointer Dereference Issues Product: gcc Version: 11.4.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: analyzer Assignee: dmalcolm at gcc dot gnu.org Reporter: tianxinghe at smail dot nju.edu.cn Target Milestone: --- $ gcc -v Using built-in specs. COLLECT_GCC=gcc COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/11/lto-wrapper OFFLOAD_TARGET_NAMES=nvptx-none:amdgcn-amdhsa OFFLOAD_TARGET_DEFAULT=1 Target: x86_64-linux-gnu Configured with: ../src/configure -v --with-pkgversion='Ubuntu 11.4.0-1ubuntu1~22.04' --with-bugurl=file:///usr/share/doc/gcc-11/README.Bugs --enable-languages=c,ada,c++,go,brig,d,fortran,objc,obj-c++,m2 --prefix=/usr --with-gcc-major-version-only --program-suffix=-11 --program-prefix=x86_64-linux-gnu- --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --enable-bootstrap --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-vtable-verify --enable-plugin --enable-default-pie --with-system-zlib --enable-libphobos-checking=release --with-target-system-zlib=auto --enable-objc-gc=auto --enable-multiarch --disable-werror --enable-cet --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic --enable-offload-targets=nvptx-none=/build/gcc-11-XeT9lY/gcc-11-11.4.0/debian/tmp-nvptx/usr,amdgcn-amdhsa=/build/gcc-11-XeT9lY/gcc-11-11.4.0/debian/tmp-gcn/usr --without-cuda-driver --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu --with-build-config=bootstrap-lto-lean --enable-link-serialization=2 Thread model: posix Supported LTO compression algorithms: zlib zstd gcc version 11.4.0 (Ubuntu 11.4.0-1ubuntu1~22.04) gcc 1.c -fanalyzer inputs can be ./t 1826809252 0 0 The checker miss the null dereference bug in line: v7 = *v6; This bug can be detected by GCC 14.2.0. The bug path: main: entry - block2 - block5 func: entry - block1 - block4 - block5 - block2 - block6 minimal test case: #include #include #ifndef __cplusplus typedef unsigned char bool; #endif void init(char**); uint8_t* malloc(uint64_t); uint64_t atol(uint8_t*); void free(uint8_t*); int main(int, char **); void func(uint64_t**, uint64_t*, uint64_t**, uint64_t); uint64_t* args; void init(char** argv) { args = (uint64_t*) malloc(3 * sizeof(uint64_t)); for (int i = 1; i <= 3; ++i) { args[i - 1] = atol(argv[i]); } } int main(int argc, char ** argv) { uint64_t input0; uint64_t* v1; uint64_t* v2; uint64_t v3; uint64_t v4; bool v5; uint64_t v6; init(argv); input0 = args[0]; v3 = input0 * 37; v2 = (uint64_t*)/*NULL*/0; v6 = 1320690439; if (input0 == 1826809252) { goto block2; } else { goto block3; } block5: v1 = &v4; *v1 = 171952983; func(&v1, &v3, &v2, input0); return 0; block2: if (input0 >= 2606378767) { goto block4; } else { goto block5; } block3: v2 = &v6; goto block2; block4: v2 = &v6; goto block5; } void func(uint64_t** a1, uint64_t* a2, uint64_t** a3, uint64_t a4) { uint64_t v1; uint8_t v2; uint64_t v3; uint64_t v4; uint64_t* v5; uint64_t* v6; uint64_t v7; uint64_t* v8; uint64_t** v9; uint64_t* v10; uint64_t** v11; uint64_t* v12; uint64_t v13; uint64_t* v14; v2 = ((uint8_t)a4); v3 = *a2; v4 = ((int64_t)(int8_t)v2); if (v4 >= (int64_t)v3) { goto block1; } else { goto block2; } block5: goto block2; block4: v5 = *v9; *v5 = 1; goto block5; block6: v6 = *a3; v7 = *v6; v13 = v7; return; block3: *v9 = (&v1); goto block4; block2: v8 = *a1; *v8 = 1; goto block6; block1: v9 = (&v14); v14 = &v13; if ((int64_t)v3 <= (int64_t)v4) { goto block3; } else { goto block4; } }
[Bug analyzer/116995] New: Missed Detection of Null Pointer Dereference Issues
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116995 Bug ID: 116995 Summary: Missed Detection of Null Pointer Dereference Issues Product: gcc Version: 11.4.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: analyzer Assignee: dmalcolm at gcc dot gnu.org Reporter: tianxinghe at smail dot nju.edu.cn Target Milestone: --- $ gcc -v Using built-in specs. COLLECT_GCC=gcc COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/11/lto-wrapper OFFLOAD_TARGET_NAMES=nvptx-none:amdgcn-amdhsa OFFLOAD_TARGET_DEFAULT=1 Target: x86_64-linux-gnu Configured with: ../src/configure -v --with-pkgversion='Ubuntu 11.4.0-1ubuntu1~22.04' --with-bugurl=file:///usr/share/doc/gcc-11/README.Bugs --enable-languages=c,ada,c++,go,brig,d,fortran,objc,obj-c++,m2 --prefix=/usr --with-gcc-major-version-only --program-suffix=-11 --program-prefix=x86_64-linux-gnu- --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --enable-bootstrap --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-vtable-verify --enable-plugin --enable-default-pie --with-system-zlib --enable-libphobos-checking=release --with-target-system-zlib=auto --enable-objc-gc=auto --enable-multiarch --disable-werror --enable-cet --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic --enable-offload-targets=nvptx-none=/build/gcc-11-XeT9lY/gcc-11-11.4.0/debian/tmp-nvptx/usr,amdgcn-amdhsa=/build/gcc-11-XeT9lY/gcc-11-11.4.0/debian/tmp-gcn/usr --without-cuda-driver --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu --with-build-config=bootstrap-lto-lean --enable-link-serialization=2 Thread model: posix Supported LTO compression algorithms: zlib zstd gcc version 11.4.0 (Ubuntu 11.4.0-1ubuntu1~22.04) gcc 1.c -fanalyzer inputs can be ./t 1826809252 0 0 The checker miss the null dereference bug in line: v7 = *v6; This bug can be detected by GCC 14.2.0. The bug path: main: entry - block2 - block5 func: entry - block1 - block4 - block5 - block2 - block6 minimal test case: #include #include #ifndef __cplusplus typedef unsigned char bool; #endif void init(char**); uint8_t* malloc(uint64_t); uint64_t atol(uint8_t*); void free(uint8_t*); int main(int, char **); void func(uint64_t**, uint64_t*, uint64_t**, uint64_t); uint64_t* args; void init(char** argv) { args = (uint64_t*) malloc(3 * sizeof(uint64_t)); for (int i = 1; i <= 3; ++i) { args[i - 1] = atol(argv[i]); } } int main(int argc, char ** argv) { uint64_t input0; uint64_t* v1; uint64_t* v2; uint64_t v3; uint64_t v4; bool v5; uint64_t v6; init(argv); input0 = args[0]; v3 = input0 * 37; v2 = (uint64_t*)/*NULL*/0; v6 = 1320690439; if (input0 == 1826809252) { goto block2; } else { goto block3; } block5: v1 = &v4; *v1 = 171952983; func(&v1, &v3, &v2, input0); return 0; block2: if (input0 >= 2606378767) { goto block4; } else { goto block5; } block3: v2 = &v6; goto block2; block4: v2 = &v6; goto block5; } void func(uint64_t** a1, uint64_t* a2, uint64_t** a3, uint64_t a4) { uint64_t v1; uint8_t v2; uint64_t v3; uint64_t v4; uint64_t* v5; uint64_t* v6; uint64_t v7; uint64_t* v8; uint64_t** v9; uint64_t* v10; uint64_t** v11; uint64_t* v12; uint64_t v13; uint64_t* v14; v2 = ((uint8_t)a4); v3 = *a2; v4 = ((int64_t)(int8_t)v2); if (v4 >= (int64_t)v3) { goto block1; } else { goto block2; } block5: goto block2; block4: v5 = *v9; *v5 = 1; goto block5; block6: v6 = *a3; v7 = *v6; v13 = v7; return; block3: *v9 = (&v1); goto block4; block2: v8 = *a1; *v8 = 1; goto block6; block1: v9 = (&v14); v14 = &v13; if ((int64_t)v3 <= (int64_t)v4) { goto block3; } else { goto block4; } }
[Bug tree-optimization/116998] [15 Regression] 5% slowdown of 400.perlbench on AMD Zen3/4 since r15-3986-g3e1bd6470e4deb
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116998 Richard Biener changed: What|Removed |Added Keywords||needs-testcase --- Comment #1 from Richard Biener --- As noted in the bug what PRE considers possibly trapping is quite conservative (and has to, as no flow-sensitive info can be used easily). But the fix was a correctness one so I'm not sure what we can do. Needs a testcase showing the 400.perlbench case for analysis.
[Bug target/116979] [12/13/14/15 regression] fma not always used in complex product
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116979 --- Comment #5 from Richard Biener --- (In reply to Andrew Pinski from comment #4) > x86: > > /app/example.cpp:6:1: note: Cost model analysis for part in loop 0: > Vector cost: 172 > Scalar cost: 184 > > > > aarch64: > /app/example.cpp:6:1: note: Cost model analysis for part in loop 0: > Vector cost: 37 > Scalar cost: 12 > > So yes a cost model issue. Note the vectorizer only does overall costing vector vs. scalar, it doesn't cost the variant with no addsub because on the vector side that's inferior (it would use blend). The vectorizer doesn't see FMAs, those get introduced later. Backend costing would need to anticipate FMA use for scalar costing and anticipate FMA cannot be used in the vector version. That's currently very difficult due to the lack of data dependence info on the vector cost side. If you want FMA for precision reasons you should probably use std::fma(..) directly.
[Bug tree-optimization/116982] [14/15 Regregression] ICE on valid code at -O3 with "-fno-tree-dce -fno-tree-dominator-opts -fno-tree-pre -fno-tree-dse -fno-tree-copy-prop -fno-tree-fre -fno-code-hoist
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116982 Richard Biener changed: What|Removed |Added Status|NEW |ASSIGNED Version|unknown |15.0 Assignee|unassigned at gcc dot gnu.org |rguenth at gcc dot gnu.org --- Comment #2 from Richard Biener --- I will have a look.
[Bug c++/113958] support visibility attribute for typeinfo symbol
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113958 Nikolas Klauser changed: What|Removed |Added CC||nikolasklauser at berlin dot de --- Comment #4 from Nikolas Klauser --- This would also be really useful for libc++ in a slightly different usage pattern. Generally, we want everything to have hidden visibility to avoid baking lots of functions into the ABI, but the typeinfo and friends have to have default visibility to support `dynamic_cast`ing across shared libraries. If we have the attribute available we can simply have `namespace __attribute__((visibility("hidden"), type_visibility("default"))) std { ... }` instead of having to annotate every single class. This would save us north of 800 annotations (plus ones we're missing) across the library, since we can roll this into our `_LIBCPP_BEGIN_NAMESPACE_STD` macro.
[Bug analyzer/116996] Missed Detection of Null Pointer Dereference Issues
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116996 Richard Biener changed: What|Removed |Added Status|UNCONFIRMED |RESOLVED Resolution|--- |DUPLICATE --- Comment #1 from Richard Biener --- . *** This bug has been marked as a duplicate of bug 116995 ***
[Bug analyzer/116995] Missed Detection of Null Pointer Dereference Issues
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116995 --- Comment #1 from Richard Biener --- *** Bug 116996 has been marked as a duplicate of this bug. ***
[Bug middle-end/116997] [13/14/15 Regression] Wrong bitfield accesses since r13-3219-g25413fdb2ac249
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116997 Richard Biener changed: What|Removed |Added CC||avieira at gcc dot gnu.org, ||rguenth at gcc dot gnu.org Blocks||53947 Target Milestone|--- |13.4 Version|unknown |15.0 Referenced Bugs: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947 [Bug 53947] [meta-bug] vectorizer missed-optimizations
[Bug tree-optimization/116855] [14/15 Regression] Unsafe early-break vectorization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116855 --- Comment #7 from rguenther at suse dot de --- On Sun, 6 Oct 2024, fxue at os dot amperecomputing.com wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116855 > > --- Comment #5 from Feng Xue --- > (In reply to Tamar Christina from comment #4) > > (In reply to Richard Biener from comment #3) > > > I would suggest to add a STMT_VINFO_ENSURE_NOTRAP or so and delay actual > > > verification to vectorizable_load when both vector type and VF are fixed. > > > We'd then likely need a LOOP_VINFO_MUST_USE_PARTIAL_VECTORS_P set > > > conservatively the other way around from > > > LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P. > > > Alignment peeling could then peel if STMT_VINFO_ENSURE_NOTRAP and the > > > target > > > cannot do full loop masking. > > > > > > Yeah the original reported testcase is fine as the alignment makes it safe. > > For the manually misaligned case that Andrew posted it makes sense to delay > > and re-evaluate later on. > > > > I don't think we should bother peeling though, I don't think they're that > > common and alignment peeling breaks some dominators and exposes some > > existing vectorizer bugs, which is being fixed in Alex's patch. > > > > So at least alignment peeling I'll defer to a later point and instead just > > reject loops that are loading from structures the user misaligned wrt to the > > vector mode. > > > > > > So mine.. > > Actually, what I wish is that we could allow vectorization on early break case > for arbitrary address pointer (knowing nothing about alignment and bound) > based > on some sort of assumption specified via command option under -Ofast, as the > mentioned example: I'd rather not have more command-line options gating "unsafe" transforms but instead have source-level control per loop via pragma. It should probably specify a length like simdlen, specifying that accessing [start + n * accesslen, start + (n+1)*accesslen - 1] is OK when the scalar loop accesses [start + n * accesslen] or so. > char * find(char *string, size_t n, char c) > { > for (size_t i = 0; i < n; i++) { > if (string[i] == c) > return &string[i]; > } > return 0; > } > > and example for which there is no way to do peeling to align more than one > address pointers: > > int compare(char *string1, char *string2, size_t n) > { > for (size_t i = 0; i < n; i++) { > if (string1[i] != string2[i]) > return string1[i] - string2[i]; > } > return 0; > }
[Bug target/116934] [15 Regression] ICE building 526.blender_r
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116934 ktkachov at gcc dot gnu.org changed: What|Removed |Added Status|UNCONFIRMED |RESOLVED Resolution|--- |FIXED --- Comment #5 from ktkachov at gcc dot gnu.org --- Thanks for the fix!
[Bug testsuite/116683] new test g++.dg/ext/pragma-unroll-lambda-lto.C from r15-3585-g9759f6299d9633 fails
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116683 Alex Coplan changed: What|Removed |Added Resolution|--- |FIXED Status|ASSIGNED|RESOLVED --- Comment #7 from Alex Coplan --- Fixed.
[Bug testsuite/116683] new test g++.dg/ext/pragma-unroll-lambda-lto.C from r15-3585-g9759f6299d9633 fails
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116683 --- Comment #6 from GCC Commits --- The master branch has been updated by Alex Coplan : https://gcc.gnu.org/g:7faadb1f261c6b8ef988c400c39ec7df09839dbe commit r15-4106-g7faadb1f261c6b8ef988c400c39ec7df09839dbe Author: Alex Coplan Date: Thu Sep 26 16:36:48 2024 +0100 testsuite: Prevent unrolling of main in LTO test [PR116683] In r15-3585-g9759f6299d9633cabac540e5c893341c708093ac I added a test which started failing on PowerPC. The test checks that we unroll exactly one loop three times with the following: // { dg-final { scan-ltrans-rtl-dump-times "Unrolled loop 3 times" 1 "loop2_unroll" } } which passes on most targets. However, on PowerPC, the loop in main gets unrolled too, causing the scan-ltrans-rtl-dump-times check to fail as the statement now appears twice in the dump. I think the extra unrolling is due to different unrolling heuristics in the rs6000 port. This patch therefore explicitly tries to block the unrolling in main with an appropriate #pragma. gcc/testsuite/ChangeLog: PR testsuite/116683 * g++.dg/ext/pragma-unroll-lambda-lto.C (main): Add #pragma to prevent unrolling of the setup loop.
[Bug tree-optimization/116998] New: [15 Regression] 5% slowdown of 400.perlbench on AMD Zen3/4 since r15-3986-g3e1bd6470e4deb
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116998 Bug ID: 116998 Summary: [15 Regression] 5% slowdown of 400.perlbench on AMD Zen3/4 since r15-3986-g3e1bd6470e4deb Product: gcc Version: 15.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: pheeck at gcc dot gnu.org CC: rguenth at gcc dot gnu.org Blocks: 26163 Target Milestone: --- Host: x86_64-pc-linux-gnu Target: x86_64-pc-linux-gnu As seen here https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=956.10.0 https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=469.10.0 there was a ~5% exec time slowdown of the 400.perlbench SPEC 2006 benchmark when run with -O2 -flto on AMD Zen3/4 machines (maybe also on other Zen microarchs). I bisected the slowdown to r15-3986-g3e1bd6470e4deb (Cc-ing richi). See comparison with other branches here: https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=938.10.0&plot.1=999.10.0&plot.2=971.10.0&plot.3=1010.10.0&plot.4=956.10.0&; Referenced Bugs: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=26163 [Bug 26163] [meta-bug] missed optimization in SPEC (2k17, 2k and 2k6 and 95)
[Bug target/116999] New: Fold SVE whilelt/le comparisons with max int value to ptrue
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116999 Bug ID: 116999 Summary: Fold SVE whilelt/le comparisons with max int value to ptrue Product: gcc Version: 15.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: ktkachov at gcc dot gnu.org Target Milestone: --- Target: aarch64 Example testcase: #include #include svbool_t foo_s32_le (int32_t x) { return svwhilele_b64_s32 (x, INT_MAX); } svbool_t foo_s64_le (int64_t x) { return svwhilele_b64_s64 (x, LONG_LONG_MAX); } can avoid generating the WHILELE instructions and just generate a PTRUE. This is as per the WHILELE documentation: "If the second scalar operand is equal to the maximum signed integer value then a condition which includes an equality test can never fail and the result will be an all-true predicate." Note that we probably want to look at the use of the flags from the whilele as well. If we cannot prove that the NZCV are unused then we have to generate a PTRUES instead, I think
[Bug tree-optimization/116998] [15 Regression] 5% slowdown of 400.perlbench on AMD Zen3/4 since r15-3986-g3e1bd6470e4deb
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116998 Filip Kastl changed: What|Removed |Added Target Milestone|--- |15.0
[Bug target/116999] Fold SVE whilelt/le comparisons with max int value to ptrue
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116999 --- Comment #1 from ktkachov at gcc dot gnu.org --- This is inspired by the LLVM PR https://github.com/llvm/llvm-project/pull/83
[Bug middle-end/116896] codegen for <=> compared to hand-written equivalent
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116896 --- Comment #29 from GCC Commits --- The master branch has been updated by Jakub Jelinek : https://gcc.gnu.org/g:37554bacfd38b1466278b529d9e70a44d7b1b909 commit r15-4105-g37554bacfd38b1466278b529d9e70a44d7b1b909 Author: Jakub Jelinek Date: Mon Oct 7 10:50:39 2024 +0200 ssa-math-opts, i386: Improve spaceship expansion [PR116896] The PR notes that we don't emit optimal code for C++ spaceship operator if the result is returned as an integer rather than the result just being compared against different values and different code executed based on that. So e.g. for template auto foo (T x, T y) { return x <=> y; } for both floating point types, signed integer types and unsigned integer types. auto in that case is std::strong_ordering or std::partial_ordering, which are fancy C++ abstractions around struct with signed char member which is -1, 0, 1 for the strong ordering and -1, 0, 1, 2 for the partial ordering (but for -ffast-math 2 is never the case). I'm afraid functions like that are fairly common and unless they are inlined, we really need to map the comparison to those -1, 0, 1 or -1, 0, 1, 2 values. Now, for floating point spaceship I've in the past already added an optimization (with tree-ssa-math-opts.cc discovery and named optab, the optab only defined on x86 though right now), which ensures there is just a single comparison instruction and then just tests based on flags. Now, if we have code like: auto a = x <=> y; if (a == std::partial_ordering::less) bar (); else if (a == std::partial_ordering::greater) baz (); else if (a == std::partial_ordering::equivalent) qux (); else if (a == std::partial_ordering::unordered) corge (); etc., that results in decent code generation, the spaceship named pattern on x86 optimizes for the jumps, so emits comparisons on the flags, followed by setting the result to -1, 0, 1, 2 and subsequent jump pass optimizes that well. But if the result needs to be stored into an integer and just returned that way or there are no immediate jumps based on it (or turned into some non-standard integer values like -42, 0, 36, 75 etc.), then CE doesn't do a good job for that, we end up with say comiss %xmm1, %xmm0 jp .L4 seta%al movl$0, %edx leal-1(%rax,%rax), %eax cmove %edx, %eax ret .L4: movl$2, %eax ret The jp is good, that is the unlikely case and can't be easily handled in straight line code due to the layout of the flags, but the rest uses cmov which often isn't a win and a weird math. With the patch below we can get instead xorl%eax, %eax comiss %xmm1, %xmm0 jp .L2 seta%al sbbl$0, %eax ret .L2: movl$2, %eax ret The patch changes the discovery in the generic code, by detecting if the future .SPACESHIP result is just used in a PHI with -1, 0, 1 or -1, 0, 1, 2 values (the latter for HONOR_NANS) and passes that as a flag in a new argument to .SPACESHIP ifn, so that the named pattern is told whether it should optimize for branches or for loading the result into a -1, 0, 1 (, 2) integer. Additionally, it doesn't detect just floating point <=> anymore, but also integer and unsigned integer, but in those cases only if an integer -1, 0, 1 is wanted (otherwise == and > or similar comparisons result in good code). The backend then can for those integer or unsigned integer <=>s return effectively (x > y) - (x < y) in a way that is efficient on the target (so for x86 with ensuring zero initialization first when needed before setcc; one for floating point and unsigned, where there is just one setcc and the second one optimized into sbb instruction, two for the signed int case). So e.g. for signed int we now emit xorl%edx, %edx xorl%eax, %eax cmpl%esi, %edi setl%dl setg%al subl%edx, %eax ret and for unsigned xorl%eax, %eax cmpl%esi, %edi seta%al sbbb$0, %al ret Note, I wonder if other targets wouldn't benefit from defining the named optab too... 2024-10-07 Jakub Jelinek PR middle-end/116896 * optabs.def (spaceship_optab): Use spaceship$a4 rather than spaceship$a3. * internal-fn.cc (expand_SPACESHIP): Expect 3 call arguments rather than 2, expand the last one, expect 4 operands of spaceship_optab. * tree-ssa-math-opts.cc: Include cfghooks.h.
[Bug libstdc++/115585] [12 Regression] --disable-libstdcxx-verbose causes undefined symbol: _ZSt21__glibcxx_assert_failPKciS0_S0_, version GLIBCXX_3.4.30
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115585 --- Comment #15 from GCC Commits --- The releases/gcc-12 branch has been updated by Jonathan Wakely : https://gcc.gnu.org/g:c4d2f51741bbb1771219fbeaaf812fa73c36fc0f commit r12-10747-gc4d2f51741bbb1771219fbeaaf812fa73c36fc0f Author: Jonathan Wakely Date: Fri Jun 28 15:14:15 2024 +0100 libstdc++: Define __glibcxx_assert_fail for non-verbose build [PR115585] When the library is configured with --disable-libstdcxx-verbose the assertions just abort instead of calling __glibcxx_assert_fail, and so I didn't export that function for the non-verbose build. However, that option is documented to not change the library ABI, so we still need to export the symbol from the library. It could be needed by programs compiled against the headers from a verbose build. The non-verbose definition can just call abort so that it doesn't pull in I/O symbols, which are unwanted in a non-verbose build. libstdc++-v3/ChangeLog: PR libstdc++/115585 * src/c++11/assert_fail.cc (__glibcxx_assert_fail): Add definition for non-verbose builds. (cherry picked from commit 52370c839edd04df86d3ff2b71fcdca0c7376a7f)
[Bug libstdc++/116641] [12 Regression] std::string move assignment incorrectly depends on POCCA
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116641 --- Comment #4 from GCC Commits --- The releases/gcc-12 branch has been updated by Jonathan Wakely : https://gcc.gnu.org/g:2ab55da5eba0aa7a92e15d8100d51cc977f9aca4 commit r12-10748-g2ab55da5eba0aa7a92e15d8100d51cc977f9aca4 Author: Jonathan Wakely Date: Tue Sep 10 14:25:41 2024 +0100 libstdc++: std::string move assignment should not use POCCA trait [PR116641] The changes to implement LWG 2579 (r10-327-gdb33efde17932f) made std::string::assign use the propagate_on_container_copy_assignment (POCCA) trait, for consistency with operator=(const basic_string&). However, this also unintentionally affected operator=(basic_string&&) which calls assign(str) to make a deep copy when performing a move is not possible. The fix is for the move assignment operator to call _M_assign(str) instead of assign(str), as this just does the deep copy and doesn't check the POCCA trait first. The bug only affects the unlikely/useless combination of POCCA==true and POCMA==false, but we should fix it for correctness anyway. it should also make move assignment slightly cheaper to compile and execute, because we skip the extra code in assign(const basic_string&). libstdc++-v3/ChangeLog: PR libstdc++/116641 * include/bits/basic_string.h (operator=(basic_string&&)): Call _M_assign instead of assign. * testsuite/21_strings/basic_string/allocator/116641.cc: New test. (cherry picked from commit c07cf418fdde0c192e370a8d76a991cc7215e9c4)
[Bug libstdc++/115399] std::tr2::dynamic_bitset shift behaves differently from std::bitset
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115399 --- Comment #9 from GCC Commits --- The releases/gcc-12 branch has been updated by Jonathan Wakely : https://gcc.gnu.org/g:1f655ef43621cc022745c3aa9c77e3725b9280cd commit r12-10753-g1f655ef43621cc022745c3aa9c77e3725b9280cd Author: Jonathan Wakely Date: Mon Jun 10 14:08:16 2024 +0100 libstdc++: Fix std::tr2::dynamic_bitset shift operations [PR115399] The shift operations for dynamic_bitset fail to zero out words where the non-zero bits were shifted to a completely different word. For a right shift we don't need to sanitize the unused bits in the high word, because we know they were already clear and a right shift doesn't change that. libstdc++-v3/ChangeLog: PR libstdc++/115399 * include/tr2/dynamic_bitset (operator>>=): Remove redundant call to _M_do_sanitize. * include/tr2/dynamic_bitset.tcc (_M_do_left_shift): Zero out low bits in words that should no longer be populated. (_M_do_right_shift): Likewise for high bits. * testsuite/tr2/dynamic_bitset/pr115399.cc: New test. (cherry picked from commit bd3a312728fbf8c35a09239b9180269f938f872e)
[Bug libstdc++/116641] [12 Regression] std::string move assignment incorrectly depends on POCCA
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116641 Jonathan Wakely changed: What|Removed |Added Resolution|--- |FIXED Status|ASSIGNED|RESOLVED --- Comment #5 from Jonathan Wakely --- Fixed for 12.5, 13.4, 14.3
[Bug libstdc++/115585] [12 Regression] --disable-libstdcxx-verbose causes undefined symbol: _ZSt21__glibcxx_assert_failPKciS0_S0_, version GLIBCXX_3.4.30
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115585 Jonathan Wakely changed: What|Removed |Added Resolution|--- |FIXED Status|NEW |RESOLVED --- Comment #16 from Jonathan Wakely --- Fixed for 12.5, 13.4 and 14.2
[Bug libstdc++/115399] std::tr2::dynamic_bitset shift behaves differently from std::bitset
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115399 --- Comment #10 from Jonathan Wakely --- And 13.4 and 12.5
[Bug libstdc++/115399] std::tr2::dynamic_bitset shift behaves differently from std::bitset
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115399 Jonathan Wakely changed: What|Removed |Added Target Milestone|13.4|12.5
[Bug tree-optimization/116982] [14/15 Regregression] ICE on valid code at -O3 with "-fno-tree-dce -fno-tree-dominator-opts -fno-tree-pre -fno-tree-dse -fno-tree-copy-prop -fno-tree-fre -fno-code-hoist
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116982 --- Comment #3 from Richard Biener --- The issue is likely that if-conversion produced a vector loop copy with a different number of exit edges than the original not if-converted version because of if-conversion doing simple DCE/FRE/DSE but the reporter disabling all those opts.
[Bug target/116979] [12/13/14/15 regression] fma not always used in complex product
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116979 --- Comment #6 from vincenzo Innocente --- I'm just taking the product of two complex numbers, cannot call std::fma in the user code: reimplementing the operator* is not trivial (and is a stdlib job anyhow)
[Bug libstdc++/116991] FAIL: 26_numerics/complex/ext_c++23.cc -std=gnu++23 (test for excess errors)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116991 Jonathan Wakely changed: What|Removed |Added Ever confirmed|0 |1 Assignee|unassigned at gcc dot gnu.org |redi at gcc dot gnu.org Last reconfirmed||2024-10-07 Target Milestone|--- |15.0 Status|UNCONFIRMED |ASSIGNED --- Comment #1 from Jonathan Wakely --- The warning is for the initialization from a long double literal in this function template: template inline std::complex<_Tp> __complex_acos(const std::complex<_Tp>& __z) { const std::complex<_Tp> __t = std::asin(__z); const _Tp __pi_2 = 1.5707963267948966192313216916397514L; return std::complex<_Tp>(__pi_2 - __t.real(), -__t.imag()); } This function template isn't used for _Float32, _Float64 etc. on other targets, because they define _GLIBCXX_USE_C99_COMPLEX_ARC and so have overloads for each type, like: inline __complex__ _Float32 __complex_acos(__complex__ _Float32 __z) { return __builtin_cacosf(__z); } We can just use a cast in the generic __complex_acos.
[Bug tree-optimization/116990] [15 Regression] ICE on valid code at -O3 "-fno-tree-ccp -fno-tree-loop-im -fno-tree-dse" on x86_64-linux-gnu: in single_pred_edge, at basic-block.h:342
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116990 --- Comment #4 from Richard Biener --- Hum. The following should have prevented that, but ... /* Check if we have any control flow that doesn't leave the loop. */ class loop *v_loop = loop->inner ? loop->inner : loop; ... not sure why we only look at the inner loop body?! basic_block *bbs = get_loop_body (v_loop); for (unsigned i = 0; i < v_loop->num_nodes; i++) if (EDGE_COUNT (bbs[i]->succs) != 1 && (EDGE_COUNT (bbs[i]->succs) != 2 || !loop_exits_from_bb_p (bbs[i]->loop_father, bbs[i]))) { free (bbs); return opt_result::failure_at (vect_location, "not vectorized:" " unsupported control flow in loop.\n"); }
[Bug tree-optimization/116990] [15 Regression] ICE on valid code at -O3 "-fno-tree-ccp -fno-tree-loop-im -fno-tree-dse" on x86_64-linux-gnu: in single_pred_edge, at basic-block.h:342
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116990 Richard Biener changed: What|Removed |Added Status|NEW |ASSIGNED Assignee|unassigned at gcc dot gnu.org |rguenth at gcc dot gnu.org --- Comment #3 from Richard Biener --- Mine.
[Bug libstdc++/116992] FAIL: 30_threads/semaphore/platform_try_acquire_for.cc -std=gnu++20 (test for excess errors)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116992 Jonathan Wakely changed: What|Removed |Added Status|UNCONFIRMED |ASSIGNED Ever confirmed|0 |1 Last reconfirmed||2024-10-07 Assignee|unassigned at gcc dot gnu.org |redi at gcc dot gnu.org Target Milestone|--- |15.0 --- Comment #1 from Jonathan Wakely --- This test uses -D_GLIBCXX_USE_POSIX_SEMAPHORE to force the use of POSIX sem_t as the base class for std::counting_semaphore, but that doesn't work on targets without sem_t.
[Bug middle-end/116997] New: [13/14/15 Regression] Wrong bitfield accesses since r13-3219-g25413fdb2ac249
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116997 Bug ID: 116997 Summary: [13/14/15 Regression] Wrong bitfield accesses since r13-3219-g25413fdb2ac249 Product: gcc Version: unknown Status: UNCONFIRMED Keywords: wrong-code Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: stefansf at gcc dot gnu.org Target Milestone: --- Target: s390x-*-* struct S0 { unsigned f0; signed f2 : 11; signed : 6; } GlobS, *Ptr = &GlobS; const struct S0 Initializer = {7, 3}; int main (void) { for (unsigned i = 0; i <= 2; i++) *Ptr = Initializer; if (GlobS.f2 != 3) __builtin_abort (); return 0; } gcc -march=z13 -O2 t.c (should fail for any arch which supports vector extensions) During ifcvt we have Start lowering bitfields Lowering: Ptr.0_1->f2 = 3; to: _ifc__24 = Ptr.0_1->D.2918; _ifc__25 = BIT_INSERT_EXPR <_ifc__24, 3, 0 (11 bits)>; Ptr.0_1->D.2918 = _ifc__25; Done lowering bitfields ... Match-and-simplified BIT_INSERT_EXPR <_ifc__24, 3, 0 (11 bits)> to 3 RHS BIT_INSERT_EXPR <_ifc__24, 3, 0 (11 bits)> simplified to 3 Setting value number of _ifc__25 to 3 (changed) Replaced BIT_INSERT_EXPR <_ifc__24, 3, 0 (11 bits)> with 3 in all uses of _ifc__25 = BIT_INSERT_EXPR <_ifc__24, 3, 0 (11 bits)>; Value numbering stmt = Ptr.0_1->D.2918 = _ifc__25; which in the end leads to the optimized tree output int main () { struct S0 * Ptr.0_1; unsigned int _2; unsigned int _3; [local count: 268435458]: Ptr.0_1 = Ptr; MEM [(void *)Ptr.0_1] = { 7, 3 }; _2 = BIT_FIELD_REF ; _3 = _2 & 4292870144; if (_3 != 6291456) goto ; [0.00%] else goto ; [100.00%] [count: 0]: __builtin_abort (); [local count: 268435456]: return 0; } Since bitfields are left aligned on s390, constant 3 is wrong and should rather be 0x60.
[Bug target/55212] [SH] Switch to LRA
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=55212 --- Comment #384 from Kazumoto Kojima --- Created attachment 59289 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=59289&action=edit a reduced test case for c#378 (with -O2 -fpic)
[Bug middle-end/116933] various issues of -ftrivial-auto-var-init=zero with Ada
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116933 --- Comment #16 from qinzhao at gcc dot gnu.org --- (In reply to Eric Botcazou from comment #12) > > We added one more argument for __builtin_clear_padding to distinguish > > whether this call is for AUTO_INIT or not. > > > > > > diff --git a/gcc/tree.cc b/gcc/tree.cc > > > index bc50afca9a3..095c02c5474 100644 > > > --- a/gcc/tree.cc > > > +++ b/gcc/tree.cc > > > @@ -9848,7 +9848,6 @@ build_common_builtin_nodes (void) > > > ftype = build_function_type_list (void_type_node, > > >ptr_type_node, > > >ptr_type_node, > > > - integer_type_node, > > > > This integer_type_node is for the new argument. > > See this assertion in gimple_fold_builtin_clear_padding: > > gcc_assert (gimple_call_num_args (stmt) == 2); Okay, by searching the history, looks like that the following patch forget to update the above routine when merging the 2nd and 3rd parameters for __builtin_clear_padding: >From b56ad95854f0b007afda60c057f10b04666953c9 Mon Sep 17 00:00:00 2001 From: Jakub Jelinek Date: Fri, 11 Feb 2022 19:47:14 +0100 Subject: [PATCH] middle-end: Small __builtin_clear_padding improvements When looking at __builtin_clear_padding today, I've noticed that it is quite wasteful to extend the original user one argument to 3, 2 is enough. We need to encode the original type of the first argument because pointer conversions are useless in GIMPLE, and we need to record a boolean whether it is for -ftrivial-auto-var-init=* or not. But for recording the type we don't need the value (we've always used zero) and for recording the boolean we don't need the type (we've always used integer_type_node). So, this patch merges the two into one. 2022-02-11 Jakub Jelinek * tree.cc (build_common_builtin_nodes): Fix up formatting in __builtin_clear_padding decl creation. * gimplify.cc (gimple_add_padding_init_for_auto_var): Encode for_auto_init in the value of 2nd BUILT_IN_CLEAR_PADDING argument rather than in 3rd argument. (gimplify_call_expr): Likewise. Fix up comment formatting. * gimple-fold.cc (gimple_fold_builtin_clear_padding): Expect 2 arguments instead of 3, take for_auto_init from the value of 2nd argument.
[Bug tree-optimization/117001] O3 auto tree loop vectorization produces incorrect output on armv8.2-a+sve
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117001 --- Comment #2 from Andrew Pinski --- Works for me on the trunk: [apinski@xeond2 upstream-cross-aarch64]$ ./install/bin/aarch64-linux-gnu-g++ -static t.cc [apinski@xeond2 upstream-cross-aarch64]$ ./install-qemu/bin/qemu-aarch64 a.out [0, 0, 0, 1, 0, 1, 1, 0] [apinski@xeond2 upstream-cross-aarch64]$ ./install/bin/aarch64-linux-gnu-g++ -static t.cc -O3 [apinski@xeond2 upstream-cross-aarch64]$ ./install-qemu/bin/qemu-aarch64 a.out [0, 0, 0, 1, 0, 1, 1, 0] [apinski@xeond2 upstream-cross-aarch64]$ ./install/bin/aarch64-linux-gnu-g++ -static t.cc -O3 -march=armv8.2-a+sve [apinski@xeond2 upstream-cross-aarch64]$ ./install-qemu/bin/qemu-aarch64 a.out [0, 0, 0, 1, 0, 1, 1, 0] [apinski@xeond2 upstream-cross-aarch64]$ ./install/bin/aarch64-linux-gnu-g++ -static t.cc -O3 -march=armv8.2-a+sve -fno-vect-cost-model [apinski@xeond2 upstream-cross-aarch64]$ ./install-qemu/bin/qemu-aarch64 a.out [0, 0, 0, 1, 0, 1, 1, 0]
[Bug tree-optimization/117001] O3 auto tree loop vectorization produces incorrect output on armv8.2-a+sve
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117001 --- Comment #5 from Robert Hardwick --- Not working on 11.4.0, i'll try 11.5.0 as you suggest. $ g++ --version g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 g++ -O3 -march=armv8.2-a+sve test.cpp -o test $:~/tools/pytorch/pytorch/argmin_test$ ./test [0, 0, 0, 1, 0, 0, 1, 0] --- yes apologies, forgot the #include line in the reproducible example code.
[Bug tree-optimization/117001] O3 auto tree loop vectorization produces incorrect output on armv8.2-a+sve
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117001 --- Comment #3 from Andrew Pinski --- Note I needed to add the following 2 includes to get the testcase to compile: ``` #include #include ```
[Bug tree-optimization/117001] O3 auto tree loop vectorization produces incorrect output on armv8.2-a+sve
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117001 Andrew Pinski changed: What|Removed |Added Status|UNCONFIRMED |WAITING Last reconfirmed||2024-10-07 Ever confirmed|0 |1 --- Comment #4 from Andrew Pinski --- Since 11.5.0 was the last in the 11.x release, can you try out 11.5.0 or even better GCC 14.2.0?
[Bug middle-end/117000] Inefficient code for 32-byte struct comparison (ptest missing)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117000 --- Comment #1 from Andrew Pinski --- >In GCC 14+ the compilation converges to test1 also in test2. So what is happening in GCC 13 is SLP vectorizer is not able to vectorizer test2 but GCC 14 is. The loop vectorizer is able to handle test1 in both.
[Bug sanitizer/116984] -fsanitize=bounds triggers within __builtin_dynamic_object_size()
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116984 qinzhao at gcc dot gnu.org changed: What|Removed |Added CC||qinzhao at gcc dot gnu.org --- Comment #10 from qinzhao at gcc dot gnu.org --- (In reply to Kees Cook from comment #4) > (In reply to Andrew Pinski from comment #1) > > I don't think so since &p->array[negative] is undefined behavior even inside > > a dynamic boz. > > Without counted_by, that is true. With counted_by all out of bounds > calculations are defined to result in a 0 bdos. The negative "counted_by" values will be treated as "zero" value, then the corresponding SIZE of the FAM is zero. However, the "counted_by" value should NOT impact the array index, therefore, for &p->array[negative] since the index of the array is NEGATIVE, it's reasonable for the sanitizer to report the error. so, from my understanding, the behavior of the testing case is correct.
[Bug target/117000] Inefficient code for 32-byte struct comparison (ptest missing)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117000 Andrew Pinski changed: What|Removed |Added Component|middle-end |target --- Comment #2 from Andrew Pinski --- Otherwise it is a vector cost model issue.
[Bug tree-optimization/116990] [15 Regression] ICE on valid code at -O3 "-fno-tree-ccp -fno-tree-loop-im -fno-tree-dse" on x86_64-linux-gnu: in single_pred_edge, at basic-block.h:342
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116990 --- Comment #5 from GCC Commits --- The master branch has been updated by Richard Biener : https://gcc.gnu.org/g:b0b71618157ddac52266909978f331406f98f3a2 commit r15-4108-gb0b71618157ddac52266909978f331406f98f3a2 Author: Richard Biener Date: Mon Oct 7 11:24:12 2024 +0200 tree-optimization/116990 - missed control flow check in vect_analyze_loop_form The following fixes checking for unsupported control flow in vectorization to also cover the outer loop body. PR tree-optimization/116990 * tree-vect-loop.cc (vect_analyze_loop_form): Check the current loop body for control flow.
[Bug tree-optimization/116990] [14 Regression] ICE on valid code at -O3 "-fno-tree-ccp -fno-tree-loop-im -fno-tree-dse" on x86_64-linux-gnu: in single_pred_edge, at basic-block.h:342
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116990 Richard Biener changed: What|Removed |Added Target Milestone|15.0|14.3 Known to work||15.0 Summary|[15 Regression] ICE on |[14 Regression] ICE on |valid code at -O3 |valid code at -O3 |"-fno-tree-ccp |"-fno-tree-ccp |-fno-tree-loop-im |-fno-tree-loop-im |-fno-tree-dse" on |-fno-tree-dse" on |x86_64-linux-gnu: in|x86_64-linux-gnu: in |single_pred_edge, at|single_pred_edge, at |basic-block.h:342 |basic-block.h:342 Priority|P3 |P2 --- Comment #6 from Richard Biener --- Fixed on trunk, queued for backporting.
[Bug tree-optimization/116982] [14/15 Regregression] ICE on valid code at -O3 with "-fno-tree-dce -fno-tree-dominator-opts -fno-tree-pre -fno-tree-dse -fno-tree-copy-prop -fno-tree-fre -fno-code-hoist
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116982 --- Comment #4 from GCC Commits --- The master branch has been updated by Richard Biener : https://gcc.gnu.org/g:9b86efd5210101954bd187c3aa8bb909610a5746 commit r15-4107-g9b86efd5210101954bd187c3aa8bb909610a5746 Author: Richard Biener Date: Mon Oct 7 11:05:17 2024 +0200 tree-optimization/116982 - analyze scalar loop exit early The following makes sure to discover the scalar loop IV exit during analysis as failure to do so (if DCE and friends are disabled this can happen due to if-conversion doing DCE and FRE on the if-converted loop) would ICE later. I refrained from larger refactoring to be able to eventually backport. PR tree-optimization/116982 * tree-vectorizer.h (vect_analyze_loop): Pass in .LOOP_VECTORIZED call. (vect_analyze_loop_form): Likewise. * tree-vect-loop.cc (vect_analyze_loop_form): Reject loops where we cannot determine a IV exit for the scalar loop. (vect_analyze_loop): Adjust. * tree-vectorizer.cc (try_vectorize_loop_1): Likewise. * tree-parloops.cc (gather_scalar_reductions): Likewise.
[Bug tree-optimization/116982] [14 Regregression] ICE on valid code at -O3 with "-fno-tree-dce -fno-tree-dominator-opts -fno-tree-pre -fno-tree-dse -fno-tree-copy-prop -fno-tree-fre -fno-code-hoisting
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116982 Richard Biener changed: What|Removed |Added Known to work||15.0 Summary|[14/15 Regregression] ICE |[14 Regregression] ICE on |on valid code at -O3 with |valid code at -O3 with |"-fno-tree-dce |"-fno-tree-dce |-fno-tree-dominator-opts|-fno-tree-dominator-opts |-fno-tree-pre -fno-tree-dse |-fno-tree-pre -fno-tree-dse |-fno-tree-copy-prop |-fno-tree-copy-prop |-fno-tree-fre |-fno-tree-fre |-fno-code-hoisting" on |-fno-code-hoisting" on |x86_64-linux-gnu: |x86_64-linux-gnu: |Segmentation fault |Segmentation fault Keywords|needs-bisection | --- Comment #5 from Richard Biener --- Fixed on trunk, queued for backporting.
[Bug middle-end/116933] various issues of -ftrivial-auto-var-init=zero with Ada
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116933 --- Comment #18 from Qing Zhao --- > On Oct 7, 2024, at 11:34, ebotcazou at gcc dot gnu.org > wrote: > I see, thanks for investigation! This was overlooked because the C family of > compiler do not use the declaration built in common_builtin_nodes, but rather > that derived from builtins.def: > > DEF_GCC_BUILTIN(BUILT_IN_CLEAR_PADDING, "clear_padding", > BT_FN_VOID_VAR, ATTR_NOTHROW_NONNULL_TYPEGENERIC_LEAF) > > which accepts any number of arguments. Oh, I see. That’s the reason this issue was just exposed now.. Thank you for fixing this issue.
[Bug tree-optimization/116583] vectorizable_slp_permutation cannot handle even/odd extract from VLA vector
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116583 Richard Sandiford changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED --- Comment #16 from Richard Sandiford --- Hopefully fixed.
[Bug tree-optimization/116578] vectorizer SLP transition issues / dependences
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116578 Bug 116578 depends on bug 116583, which changed state. Bug 116583 Summary: vectorizable_slp_permutation cannot handle even/odd extract from VLA vector https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116583 What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED
[Bug middle-end/116933] various issues of -ftrivial-auto-var-init=zero with Ada
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116933 --- Comment #17 from Eric Botcazou --- > Okay, by searching the history, looks like that the following patch forget > to update the above routine when merging the 2nd and 3rd parameters for > __builtin_clear_padding: I see, thanks for investigation! This was overlooked because the C family of compiler do not use the declaration built in common_builtin_nodes, but rather that derived from builtins.def: DEF_GCC_BUILTIN(BUILT_IN_CLEAR_PADDING, "clear_padding", BT_FN_VOID_VAR, ATTR_NOTHROW_NONNULL_TYPEGENERIC_LEAF) which accepts any number of arguments.
[Bug middle-end/116983] counted_by not used to identify singleton pointers
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116983 qinzhao at gcc dot gnu.org changed: What|Removed |Added CC||qinzhao at gcc dot gnu.org --- Comment #1 from qinzhao at gcc dot gnu.org --- (In reply to Kees Cook from comment #0) > When counted_by is present in a structure, it means that the object must be > a singleton. > > For example: > > struct counted { > int counter; > int array[] __attribute__((counted_by(counter))); > }; > > struct notcounted { > int counter; > int array[]; > }; > > void __attribute__((noinline)) emit_length(size_t length) > { > printf("%zu\n", length); > } > > // This correctly cannot know size of p object, and returns SIZE_MAX > void objsize_notcounted(struct notcounted *p) > { > emit_length(__builtin_dynamic_object_size(p, 1)); > } > > // This must be operating on a singleton, therefor the > // return must be: > // max(sizeof(*p), > // sizeof(*p) + offsetof(typeof(*p), array) * p->counter) > void objsize_counted(struct counted *p) > { > emit_length(__builtin_dynamic_object_size(p, 1)); > } could you explicitly explain what's wrong in the current implementation?
[Bug rtl-optimization/117000] New: Inefficient code for 32-byte struct comparison (ptest missing)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117000 Bug ID: 117000 Summary: Inefficient code for 32-byte struct comparison (ptest missing) Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: chfast at gmail dot com Target Milestone: --- I was investigating why in GCC 13.3 the functions test1 and test2 produce different x86 assembly. They only differ by the placement of the int -> U256 user defined conversion. This lead to the discovery that the generated x86-64-v2 for all the examples is not very efficient. E.g. for some reason a shift instruction is used (psrldq). In GCC 14+ the compilation converges to test1 also in test2. https://godbolt.org/z/r1vfcPone using uint64_t = unsigned long; struct U256 { uint64_t words_[4]{}; U256(uint64_t v) : words_{v} {} }; bool eq(const U256& x, const U256& y) { uint64_t folded = 0; for (int i = 0; i < 4; ++i) folded |= (x.words_[i] ^ y.words_[i]); return folded == 0; } bool eqi(const U256& x, uint64_t y) { return eq(x, U256(y)); } auto test1(const U256& x) { return eqi(x, uint64_t(0)); } bool test2(const U256& x) { return eq(x, U256(0)); } test1(U256 const&): movdqu xmm1, XMMWORD PTR [rdi+16] movdqu xmm0, XMMWORD PTR [rdi] por xmm0, xmm1 movdqa xmm1, xmm0 psrldq xmm1, 8 por xmm0, xmm1 movqrax, xmm0 testrax, rax seteal ret test2(U256 const&): mov rax, QWORD PTR [rdi] or rax, QWORD PTR [rdi+8] or rax, QWORD PTR [rdi+16] or rax, QWORD PTR [rdi+24] seteal ret
[Bug c++/117001] New: O3 auto tree loop vectorization produces incorrect output on armv8.2-a+sve
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117001 Bug ID: 117001 Summary: O3 auto tree loop vectorization produces incorrect output on armv8.2-a+sve Product: gcc Version: 10.2.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: Robert.Hardwick at arm dot com Target Milestone: --- We have seen some incorrect numbers being produced when O3 is enabled on Arm Neoverse V1 ( armv8.2-a+sve ). I have reduced the problem down to a small reproducer and identified that adding -fno-tree-loop-vectorize to gcc options will produce the correct output. It seems to happen when we have a C style array contained within a std::array stucture and it occurs when auto loop vectorization is enabled. This has been observed on 10.2.1 and 11.4.1 Reproducible example #include typedef std::array my_type; // helpful to print output to stdout std::ostream& operator<<(std::ostream& stream, const my_type& vec) { stream << "["; for ( int j = 0; j < 2; j++){ for (int i = 0; i != 4; i++) { if (i != 0 || j != 0) { stream << ", "; } stream << vec[j][i]; } } stream << "]"; return stream; } int main() { my_type a = {{0, 0, 0, 1, 0, 0, 1, 0}}; my_type b = {{1, 1, 1, 1, 1, 1, 1, 1}}; my_type mask = {{0, 0, 0, 0, 0, 1, 0, 0}}; my_type result = {{0, 0, 0, 0, 0, 0, 0, 0}}; for (int i = 0; i < 2; i++) { for (int j = 0; j < 4; j++) { if ( mask[i][j] != 0 ) { result[i][j] = b[i][j]; } else { result[i][j] = a[i][j]; } } } std::cout << result << std::endl; } Observations With -O3 -fno-tree-loop-vectorize -march=armv8.2-a+sve output is INCORRECT [0, 0, 0, 1, 0, 0, 1, 0] with -O3 -march=armv8.2-a+sve output is CORRECT [0, 0, 0, 1, 0, 1, 1, 0] The operation should be doing the equivalent of result[i] = mask[i] ? b[i] : a[i] So the 6th element ( at i=1, j=1 ) should be 1, not 0.
[Bug c++/117001] O3 auto tree loop vectorization produces incorrect output on armv8.2-a+sve
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117001 --- Comment #1 from Robert Hardwick --- Apolgies, i've got that the wrong way around. With -O3 -fno-tree-loop-vectorize -march=armv8.2-a+sve output is CORRECT [0, 0, 0, 1, 0, 1, 1, 0] with -O3 -march=armv8.2-a+sve output is INCORRECT [0, 0, 0, 1, 0, 0, 1, 0]
[Bug sanitizer/116984] -fsanitize=bounds triggers within __builtin_dynamic_object_size()
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116984 Jakub Jelinek changed: What|Removed |Added Resolution|--- |INVALID Status|UNCONFIRMED |RESOLVED --- Comment #9 from Jakub Jelinek --- (In reply to Kees Cook from comment #8) > (In reply to Jakub Jelinek from comment #6) > > counted_by is just another way how to get the initial whole > > object dynamic size (similarly to fixed size automatic/static vars, malloc > > etc., alloca, VLA definitions, whatever else provides the size of the whole > > object). > > I don't understand why the word "initial" is used there. It provides the > _ongoing_ runtime bounds of the given array. Both the bounds sanitizer and > __bdos were extended to make use of that information. It is initial in the workflow of the object size pass, which has some __builtin_object_size/__builtin_dynamic_object_size calls (explicit in the IL or implicit e.g. from sanitization) and tracks object sizes through pointer arithmetics and PHIs back to something which has a known object (or subobject) size. In your testcase, p->array has that known size because of counted_by attribute, in other cases it could be a pointer initialized from malloc or similar calls, in other cases it could be an object with non-dynamic size. > > The rest is __builtin_dynamic_object_size dynamic tracking from that size > > through pointer arithmetics etc. And that doesn't change depending on what > > the whole size has been computed with. > > Part of the bounds sanitizer+__bdos work was to make sure that getting the > size of invalidly indexed array element is 0 (and _not_ "don't know", since > we *do* know: there is no element at an invalid location, therefore the size > available at such an "address" is 0 bytes). This is so that the various You can't get anything safely after invoking UB, there is nothing safe after that, anything can happen. The side-effects imply don't know special case was done for __builtin_object_size (and later for __builtin_dynamic_object_size just inherited it too) so that one could use the builtin e.g. in macros and don't risk side-effects being evaluated multiple times. So it is for cases like __builtin_object_size (&a[i++], 0) where we just say the function will return "don't know" and will not evaluate the expression. If there aren't side-effects in the C/C++ meaning (that includes e.g. reading volatile vars, calling impure functions etc.), the expression in the __bos/__bdos argument is evaluated at runtime and if there is UB in it, the program is still invalid, and the compiler really can't guarantee anything about that. Consider if you have int *pindex; __builtin_dynamic_object_size (&a[*pindex], 0); If the pindex is NULL or otherwise invalid pointer, the program will invoke UB and certainly doesn't guarantee returning 0. It can be diagnosed by UBSan (e.g. the NULL case), it might be diagnosed by ASan (invalid pointer, some cases of it), it might not be diagnosed at all and just crash. Similarly, if you have int idx1, idx2; __builtin_dynamic_object_size (&a[idx1 + idx2], 0); and idx1 + idx2 evaluation invokes signed integer overflow, again it will be UB and anything can happen. And the negative index on array ref is yet another UB. If __bos/__bdos doesn't have the side-effects which cause immediate folding of the builtin with its arguments to the "don't know" value, then the expressions are simply lowered to normal IL like anything else and can be instrumented for UB like anything else, I don't think LLVM has any other representation for it where you could safely invoke any kind of UB you'd like and all it would do is change the return value of the __bos/__bdos call to 0.
[Bug tree-optimization/116974] omp inscan reduction not supported with SLP
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116974 --- Comment #2 from Richard Biener --- One issue is that with SLP scheduling we're relying on data dependence to order vector stmts in the correct order. With omp scan we have scalar code like _12 = .GOMP_SIMD_LANE (simduid.2_6(D), 0); _13 = .GOMP_SIMD_LANE (simduid.2_6(D), 1); D.2789[_13] = 0; _15 = (long unsigned int) i_42; _16 = _15 * 4; _18 = a_17(D) + _16; _19 = *_18; r.0_20 = D.2789[_12]; _21 = _19 + r.0_20; D.2789[_12] = _21; _23 = .GOMP_SIMD_LANE (simduid.2_6(D), 2); _24 = D.2790[_23]; _25 = D.2789[_23]; _26 = _24 + _25; D.2790[_23] = _26; D.2789[_23] = _26; _30 = b_29(D) + _16; r.0_31 = D.2789[_12]; *_30 = r.0_31; where vectorization of the in-scan reduction is currently performed by vectorizable_scan_store on the D.2790[_23] = _26 store. But vector stmt order with respect to the other "SLP instance" defining D.2789 for non-SLP simply relies on us emitting vector stmts where scalar stmts are but with SLP this only works because in the end we're using the first scalar stmt as point to emit. I think it would be preferable iff the temporaries would be elided as SSA names and thus not appear as loads/stores. I'm not sure whether this whole inscan / scan stuff would be necessary if we'd support vectorizing reductions that are used inside of the loop. Like by forcing them to be in-order and constructing the vector of reduction values in each iteration. That we key off the reduction code-gen from the store and not the add isn't helpful. Very likely a cleaner solution would at least make the scan-reduction visible during SLP discovery so we can key code generation off that stmt. We'd want the reduction operands (_24 and _25 above) as well as that initialization value (0 from the D.2789[_13] = 0 store) as children. I do have a hackish patch to make most cases work with the current scheme though.
[Bug rtl-optimization/114729] RISC-V SPEC2017 507.cactu excessive spillls with -fschedule-insns
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114729 --- Comment #20 from Vineet Gupta --- The model schedule change (at tweak9) seems stable and showing very promising result. The hottest basic block's reg pressure drops down significantly ;; Pressure summary (bb 206): GR_REGS:313 FP_REGS:946 ;; Pressure summary (bb 221): GR_REGS:312 FP_REGS:946 vs. ;; Pressure summary (bb 206): GR_REGS:269 FP_REGS:285 ;; Pressure summary (bb 221): GR_REGS:268 FP_REGS:285 riscv qemu icounts 2,127,546,200,703 # fix2 tweak9 (~16% improv) 2,544,112,250,412 # baseline aarch64 qemu icount 1,240,904,969,590 # fix2 tweak9 (~10% improv) 1,371,320,697,809 # baseline
[Bug target/80881] Implement Windows native TLS
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80881 --- Comment #26 from LIU Hao --- Comment on attachment 59290 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=59290 Newer patch for TLS support, incomplete > + "mov{l}\t{_tls_index(%%rip), %k0|%k0, DWORD PTR > [rip+_tls_index]}\;mov{q}\t{%%gs:88, %1|%1, QWORD PTR > gs:[88]}\;mov{q}\t{(%1,%0,8), %0|%0, QWORD PTR [%1+%0*8]}" For i686 this would be (untested): ``` "mov{l}\t{_tls_index, %k0|%k0, DWORD PTR [_tls_index]}\;mov{l}\t{%%fs:44, %1|%1, DWORD PTR fs:[44]}\;mov{l}\t{(%1,%0,4), %0|%0, DWORD PTR [%1+%0*4]}" ``` i.e. pointer size is 4 (instead of 8), TLS segment is FS (instead of GS), and addresses of global symbols are absolute (instead of being RIP-relative).
[Bug rtl-optimization/114729] RISC-V SPEC2017 507.cactu excessive spillls with -fschedule-insns
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114729 --- Comment #21 from Vineet Gupta --- The code is currently pushed to https://github.com/vineetgarc/gcc/commits/topic-sched1/
[Bug target/117008] -march=native pessimization of 25% with bitset popcount
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117008 --- Comment #4 from Matt Bentley --- Yeah, I know, I mentioned that in the report. It's not a bad benchmark, it's benchmarking access of individual consecutive bits, not summing. The counting is merely for preventing the compiler from optimizing out the loop. I could equally make it benchmark random indices and I imagine the problem would remain, though I haven't checked. Still, your point is valid in that most non-benchmark code would likely have more code around the access. Could potentially lead to misleading benchmark results in other scenarios though. I haven't tested whether vector/array indexing triggers the same bad vectorisation.
[Bug target/117008] -march=native pessimization of 25% with bitset popcount
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117008 --- Comment #5 from Matt Bentley --- (In reply to Andrew Pinski from comment #1) > Can you provide the output of invoking g++ with -march=native and when > compiling? The .ii files were identical, so did you you mean .o files?
[Bug target/117006] New: [15 regression] GCC trunk generates larger code than GCC 14 at -Os
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117006 Bug ID: 117006 Summary: [15 regression] GCC trunk generates larger code than GCC 14 at -Os Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: dccitaliano at gmail dot com Target Milestone: --- Similar to https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116994 https://godbolt.org/z/64bxGvnrh long patatino() { long x = 0; for (int i = 0; i < 5; ++i) { while (x < 10) { if ((x % 2 == 0 && x % 3 != 0) || (x % 5 == 0 && x > 5)) { if (x > 7) { x += 4; } else { x += 2; } } else { x += 1; } while (x % 4 == 0) { x += 3; } } } return x; } In particular, trunk generates .L4: lea rax, [rcx+4] lea rdx, [rcx+2] cmp rcx, 8 cmovl rax, rdx mov rcx, rax jmp .L7 instead of: .L4: cmp rcx, 7 jle .L6 add rcx, 4 jmp .L7 .L6: add rcx, 2 jmp .L7
[Bug target/117007] New: Poor optimiation for small vector constants needed for vector shift/rotate/mask genration.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117007 Bug ID: 117007 Summary: Poor optimiation for small vector constants needed for vector shift/rotate/mask genration. Product: gcc Version: 13.2.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: munroesj at gcc dot gnu.org Target Milestone: --- Created attachment 59291 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=59291&action=edit compile withe -m64 -O3 -mcpu=power8 or power9 For vector library codes there is frequent need toe "splat" small integer constants needed for vector shifts, rotates, and mask generation. The instructions exist (i.e. vspltisw, xxspltib, xxspltiw) supported by intrinsic. But when these are used to provide constants VRs got other vector operations the compiler goes out of is way to convert them to vector loads from .rodata. This is especially bad for power8/9 as .rodata require 32-bit offsets and always generate 3/4 instructions with a best case (L1 cache hit) latency of 9 cycles. The original splat immediate / shift implementation will run 2-4 instruction (with a good chance for CSE) and 4-6 cycles latency. For example: vui32_t mask_sig_v2 () { vui32_t ones = vec_splat_u32(-1); vui32_t shft = vec_splat_u32(9); return vec_vsrw (ones, shft); } With GCC V6 generates: 01c0 : 1c0: 8c 03 09 10 vspltisw v0,9 1c4: 8c 03 5f 10 vspltisw v2,-1 1c8: 84 02 42 10 vsrwv2,v2,v0 1cc: 20 00 80 4e blr While with GCC 13.2.1 generates: 01c0 : 1c0: 00 00 4c 3c addis r2,r12,0 1c0: R_PPC64_REL16_HA .TOC. 1c4: 00 00 42 38 addir2,r2,0 1c4: R_PPC64_REL16_LO .TOC.+0x4 1c8: 00 00 22 3d addis r9,r2,0 1c8: R_PPC64_TOC16_HA .rodata.cst16+0x20 1cc: 00 00 29 39 addir9,r9,0 1cc: R_PPC64_TOC16_LO .rodata.cst16+0x20 1d0: ce 48 40 7c lvx v2,0,r9 1d4: 20 00 80 4e blr this is the samel for -mcpu=power8/power9 it gets worse for vector functions that require multiple shift/mask constants. For example: // Extract the float sig vui32_t test_extsig_v2 (vf32_t vrb) { const vui32_t zero = vec_splat_u32(0); const vui32_t sigmask = mask_sig_v2 (); const vui32_t expmask = mask_exp_v2 (); #if 1 vui32_t ones = vec_splat_u32(-1); const vui32_t hidden = vec_sub (sigmask, ones); #else const vui32_t hidden = mask_hidden_v2 (); #endif vui32_t exp, sig, normal; exp = vec_and ((vui32_t) vrb, expmask); normal = vec_nor ((vui32_t) vec_cmpeq (exp, expmask), (vui32_t) vec_cmpeq (exp, zero)); sig = vec_and ((vui32_t) vrb, sigmask); // If normal merger hidden-bit the sig-bits return (vui32_t) vec_sel (sig, normal, hidden); } GCC V6 generated: 0310 : 310: 8c 03 bf 11 vspltisw v13,-1 314: 8c 03 37 10 vspltisw v1,-9 318: 8c 03 60 11 vspltisw v11,0 31c: 06 0a 0d 10 vcmpgtub v0,v13,v1 320: 84 09 00 10 vslwv0,v0,v1 324: 8c 03 29 10 vspltisw v1,9 328: 17 14 80 f1 xxland vs44,vs32,vs34 32c: 84 0a 2d 10 vsrwv1,v13,v1 330: 86 00 0c 10 vcmpequw v0,v12,v0 334: 86 58 8c 11 vcmpequw v12,v12,v11 338: 80 6c a1 11 vsubuwm v13,v1,v13 33c: 17 14 41 f0 xxland vs34,vs33,vs34 340: 17 65 00 f0 xxlnor vs32,vs32,vs44 344: 7f 03 42 f0 xxsel vs34,vs34,vs32,vs45 348: 20 00 80 4e blr While GCC 13.2.1 -mcpu=power8 generates: 360 : 360: 00 00 4c 3c addis r2,r12,0 360: R_PPC64_REL16_HA .TOC. 364: 00 00 42 38 addir2,r2,0 364: R_PPC64_REL16_LO .TOC.+0x4 368: 00 00 02 3d addis r8,r2,0 368: R_PPC64_TOC16_HA .rodata.cst16+0x30 36c: 00 00 42 3d addis r10,r2,0 36c: R_PPC64_TOC16_HA .rodata.cst16+0x20 370: 8c 03 a0 11 vspltisw v13,0 374: 00 00 08 39 addir8,r8,0 374: R_PPC64_TOC16_LO .rodata.cst16+0x30 378: 00 00 4a 39 addir10,r10,0 378: R_PPC64_TOC16_LO .rodata.cst16+0x20 37c: 00 00 22 3d addis r9,r2,0 37c: R_PPC64_TOC16_HA .rodata.cst16+0x40 380: e4 06 4a 79 rldicr r10,r10,0,59 384: ce 40 20 7c lvx v1,0,r8 388: 00 00 29 39 addir9,r9,0 388: R_PPC64_TOC16_LO .rodata.cst16+0x40 38c: 8c 03 17 10 vspltisw v0,-9 390: 98 56 00 7c lxvd2x vs0,0,r10 394: e4 06 29 79 rldicr r9,r9,0,59 398: 98 4e 80 7d lxvd2x vs12,0,r9 39c: 84 01 21 10 vslwv1,v1,v0 3a0: 50 02 00 f0 xxswapd vs0,vs0 3a4: 17 14 01 f0 xxland vs32,vs33,vs34 3a8: 50 62 8c f1 xxswapd vs12,vs12
[Bug target/117007] Poor optimiation for small vector constants needed for vector shift/rotate/mask genration.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117007 Peter Bergner changed: What|Removed |Added CC||bergner at gcc dot gnu.org, ||guojiufu at gcc dot gnu.org, ||linkw at gcc dot gnu.org, ||meissner at gcc dot gnu.org, ||segher at gcc dot gnu.org --- Comment #1 from Peter Bergner --- Jeff, did any of your recent constant patches help with this or is this something different?
[Bug c++/117004] New: Unexpected const variable type with decltype of non-type template parameter of deduced type
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117004 Bug ID: 117004 Summary: Unexpected const variable type with decltype of non-type template parameter of deduced type Product: gcc Version: 14.1.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: barry.revzin at gmail dot com Target Milestone: --- This is similar to #99631, but this example deals with scalars and still fails on trunk. Attempted reduction: #include template struct integral_constant { static constexpr int value = V; }; template using value_type = decltype(V); void f() { [](T) { // this fails on gcc (which thinks it's "const int") // passes on clang static_assert(std::same_as, int>); // .. but this DOES pass on gcc??? static_assert(__is_same(value_type, int)); }(integral_constant<0>()); } value_type<0> is int, but gcc sometimes thinks for more complicated spellings of 0 that it is const int.
[Bug libstdc++/117005] New: Parallel Mode algorithms need to qualify all calls
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117005 Bug ID: 117005 Summary: Parallel Mode algorithms need to qualify all calls Product: gcc Version: 15.0 Status: UNCONFIRMED Keywords: wrong-code Severity: normal Priority: P3 Component: libstdc++ Assignee: unassigned at gcc dot gnu.org Reporter: redi at gcc dot gnu.org Target Milestone: --- libstdc++-v3/includeparallel/algo.h is full of unqualified calls like: template inline typename iterator_traits<_IIter>::difference_type count(_IIter __begin, _IIter __end, const _Tp& __value) { return __count_switch(__begin, __end, __value, std::__iterator_category(__begin)); } This performs ADL for __count_switch, which can cause the compiler to attempt to complete incomplete classes to find associated namespaces (and it's just slower to do ADL than qualified lookups). Everything should be qualified except calls to std::swap (which were removed by g:fc722a0ea442f0 anyway). This causes at least one testsuite failure, but should fail elsewhere too if we tested the other algos similarly: FAIL: 24_iterators/indirect_callable/projected-adl.cc -std=gnu++20 (test for excess errors) Excess errors: /home/jwakely/src/gcc/gcc/libstdc++-v3/testsuite/24_iterators/indirect_callable/projected-adl.cc:23: error: 'Holder::t' has incomplete type
[Bug middle-end/116983] counted_by not used to identify singleton pointers
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116983 Jakub Jelinek changed: What|Removed |Added CC||jakub at gcc dot gnu.org --- Comment #2 from Jakub Jelinek --- I'm not sure we can derive anything from the pointer type (especially but not only because pointer conversions are useless in GIMPLE), like we e.g. try not to derive alignment from it. A lot of code will simply cast pointers to other pointer types, what really matters is which pointer types have been dereferenced. In the FAM case &p->array[0] or similar represents that dereferencing, the pointer then has to point to the FAM object, but mere declaration of pointer type on function arg doesn't mean much, there could be cast in the caller and in the callee too.
[Bug c++/97375] Unexpected top-level const retainment when declaring non-type template paramter with decltype(auto)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97375 Andrew Pinski changed: What|Removed |Added Target Milestone|--- |12.0 Status|NEW |RESOLVED Resolution|--- |FIXED --- Comment #3 from Andrew Pinski --- (In reply to Marek Polacek from comment #2) > Fixed on trunk by r12-1224. Maybe we want to backport that to 11 too. GCC 11.5.0 was the final release from the GCC 11 branch so closing as fixed.
[Bug c++/115314] auto template parameter has const qualifier on it even though the original does not
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115314 Andrew Pinski changed: What|Removed |Added CC||barry.revzin at gmail dot com --- Comment #2 from Andrew Pinski --- *** Bug 117004 has been marked as a duplicate of this bug. ***
[Bug c++/117004] Unexpected const variable type with decltype of non-type template parameter of deduced type
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117004 Andrew Pinski changed: What|Removed |Added Resolution|--- |DUPLICATE Status|UNCONFIRMED |RESOLVED --- Comment #2 from Andrew Pinski --- (In reply to Andrew Pinski from comment #1) > I think there is a dup of this one around. Yes PR 115314. *** This bug has been marked as a duplicate of bug 115314 ***
[Bug target/117006] [15 regression] GCC trunk generates larger code than GCC 14 at -Os
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117006 Andrew Pinski changed: What|Removed |Added Keywords||missed-optimization --- Comment #1 from Andrew Pinski --- For the reduced testcase: ``` int f(long a, int b) { if (b > 7) return a+4; return a+2; } ``` GCC 14 is smaller by 1 byte than GCC 15. But for a slightly different testcase: ``` int h(long); int f(long a, int b) { if (b > 7) a+=4; else a+=2; return h(a); } ``` The trunk (cmov) is smaller by 1 byte. So this looks like it is just by accident. -Os sometimes is just heurstics and one or 2 bytes different here in both directions might be a wash overall. So it is hard to tell if this will cause a real issue for -Os.
[Bug target/117008] -march=native pessimization of 25% with bitset popcount
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117008 Andrew Pinski changed: What|Removed |Added Ever confirmed|1 |0 Blocks||53947 Status|NEW |UNCONFIRMED --- Comment #3 from Andrew Pinski --- Note your `total+=values[index];` loop could be reduced down to just `total += values.count();` and that will over 10x faster. I am not sure sure if this is useful benchmark either. because count uses popcount directly. Maybe GCC could detect the popcount here but I am not sure. LLVM does a slightly better job at vectorizing the loop but still messes it up. Plus once you add other code around values[index], the vectorizer will no longer kick in so the slow down is only for this bad micro-benchmark. Referenced Bugs: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947 [Bug 53947] [meta-bug] vectorizer missed-optimizations
[Bug target/117008] -march=native pessimization of 25% with bitset
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117008 Andrew Pinski changed: What|Removed |Added Keywords||missed-optimization Status|UNCONFIRMED |NEW Ever confirmed|0 |1 Last reconfirmed||2024-10-08 Target||x86_64-*-* --- Comment #2 from Andrew Pinski --- Looks like the reduction loop is vectorized and that is causing the slow down. Semi reduced (unincluded) testcase: ``` #include void g(std::bitset<1280> &); int f() { unsigned int total = 0; std::bitset<1280> values; g(values); for (unsigned int index = 0; index != 1280; ++index) total += values[index]; return total ; } ``` For Linux, you need `-m32 -O2 -mavx2` (-m32 since it uses long and for mingw that is 32bits while for linux it is 64bits and that does not get vectorized).
[Bug target/80881] Implement Windows native TLS
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80881 --- Comment #27 from Julian Waters --- X(In reply to LIU Hao from comment #26) > Comment on attachment 59290 [details] > Newer patch for TLS support, incomplete > > > + "mov{l}\t{_tls_index(%%rip), %k0|%k0, DWORD PTR > > [rip+_tls_index]}\;mov{q}\t{%%gs:88, %1|%1, QWORD PTR > > gs:[88]}\;mov{q}\t{(%1,%0,8), %0|%0, QWORD PTR [%1+%0*8]}" > > For i686 this would be (untested): > > ``` > "mov{l}\t{_tls_index, %k0|%k0, DWORD PTR [_tls_index]}\;mov{l}\t{%%fs:44, > %1|%1, DWORD PTR fs:[44]}\;mov{l}\t{(%1,%0,4), %0|%0, DWORD PTR [%1+%0*4]}" > ``` > > i.e. pointer size is 4 (instead of 8), TLS segment is FS (instead of GS), > and addresses of global symbols are absolute (instead of being RIP-relative). I think I remember clang using __tls_index instead of _tls_index for 32 bit as well, but that's the only difference I remember. On another note, Cygwin doesn't support TLS natively, right? Eric modified the stopgap patch above and he put some definitions in cygming.h, since he expects it to support Cygwin as well, but I vaguely remember you saying something about Cygwin not having the support for this
[Bug target/80881] Implement Windows native TLS
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80881 --- Comment #28 from LIU Hao --- (In reply to Julian Waters from comment #27) > I think I remember clang using __tls_index instead of _tls_index for 32 bit > as well, but that's the only difference I remember. On another note, Cygwin Yes, you are right. Solely for i686, external symbols have to be prefixed by an underscore. > doesn't support TLS natively, right? Eric modified the stopgap patch above > and he put some definitions in cygming.h, since he expects it to support > Cygwin as well, but I vaguely remember you saying something about Cygwin not > having the support for this Correct, because the Cygwin CRT doesn't have a TLS directory. You can use `objdump -h` to print PE headers of a Cygwin executable, and there is no `.tls` section. An application may provide its own TLS directory, but it's not default.
[Bug target/117010] New: [nvptx] Incorrect ptx code-gen for C++ code with templates
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117010 Bug ID: 117010 Summary: [nvptx] Incorrect ptx code-gen for C++ code with templates Product: gcc Version: 15.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: prathamesh3492 at gcc dot gnu.org Target Milestone: --- Hi, For the test-case adapted from pr96390.C: template struct V { int version_called; V () { version_called = 1; } }; void foo() { V<0> v; } int main () { #pragma omp target { foo (); } return 0; } Compiling with -O0 -fopenmp -foffload=nvptx-none -foffload=-malias -foffload=-mptx=6.3, results in following error: ptxas ./a.xnvptx-none.mkoffload.o, line 45; error : Call to '_ZN1VILi0EEC1Ev' requires call prototype ptxas ./a.xnvptx-none.mkoffload.o, line 45; error : Unknown symbol '_ZN1VILi0EEC1Ev' ptxas fatal : Ptx assembly aborted due to errors nvptx-as: ptxas returned 255 exit status The reason this happens is because in PTX code-gen, we call _ZN1VILi0EEC1Ev from _Z3foov, however _ZN1VILi0EEC1Ev is not defined anywhere. Instead it contains the following definition: // BEGIN FUNCTION DECL: _ZN1VILi0EEC2Ev .func _ZN1VILi0EEC2Ev (.param.u64 %in_ar0); where _ZN1VILi0EEC1Ev is an (implicit) alias for _ZN1VILi0EEC2Ev in callgraph. Thanks, Prathamesh
[Bug middle-end/117003] pr104783.c is miscompiled with offloading and results in segmentation fault during host-only execution for -O1 and above
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117003 Thomas Schwinge changed: What|Removed |Added Status|UNCONFIRMED |RESOLVED Resolution|--- |DUPLICATE --- Comment #1 from Thomas Schwinge --- Duplicate of PR105001 "If executing with non-nvptx offloading, but nvptx offloading compilation is enabled: FAIL: libgomp.c/pr104783.c execution test", as far as I can tell. *** This bug has been marked as a duplicate of bug 105001 ***
[Bug middle-end/105001] If executing with non-nvptx offloading, but nvptx offloading compilation is enabled: FAIL: libgomp.c/pr104783.c execution test
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105001 Thomas Schwinge changed: What|Removed |Added CC||prathamesh3492 at gcc dot gnu.org --- Comment #3 from Thomas Schwinge --- *** Bug 117003 has been marked as a duplicate of this bug. ***
[Bug tree-optimization/117000] Inefficient code for 32-byte struct comparison (ptest missing)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117000 Richard Biener changed: What|Removed |Added Component|target |tree-optimization Status|UNCONFIRMED |ASSIGNED Version|unknown |13.3.0 Last reconfirmed||2024-10-08 Ever confirmed|0 |1 --- Comment #3 from Richard Biener --- In particular we miss the fact that _29 = .REDUC_IOR (vect_folded_10.32_23); _12 = _29 == 0; could be optimized to _12 = vect_folded_10.32_23 == {0, 0, ... }; it's probably too late for RTL to realize this. Some pattern in match.pd could handle this, like (for cmp (eq ne) (simplify (cmp (IFN_REDUC_IOR @0) integer_zerop) (cmp @0 { build_zero_cst (TREE_TYPE (@0)); } ))) results in _Z5test1RK4U256: .LFB5: .cfi_startproc movdqu (%rdi), %xmm0 movdqu 16(%rdi), %xmm1 por %xmm1, %xmm0 ptest %xmm0, %xmm0 sete%al ret _Z5test2RK4U256: .LFB6: .cfi_startproc movdqu 16(%rdi), %xmm0 movdqu (%rdi), %xmm1 por %xmm1, %xmm0 ptest %xmm0, %xmm0 sete%al ret