[Bug c++/96535] New: GCC 10 ignoring function __attribute__ optimize for all x86
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96535 Bug ID: 96535 Summary: GCC 10 ignoring function __attribute__ optimize for all x86 Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: danielhanchen at gmail dot com Target Milestone: --- Hey GCC team! In GCC 10.x, it seems like any argument to __attribute__((optimize(...)) is ignored at the function level. GCC 9.x and previous do not have this issue. [Or maybe only -funroll-loops is ignored not 100% sure] Detailed example at: https://gcc.godbolt.org/z/PTK4WE 3 Scenarios 1. [GCC 10.2: -O2 -ffast-math -march=haswell -std=c++2a -fopenmp] + [__attribute__((optimize("O2","fast-math","unroll-loops")))] DOES NOT unroll. 2. [GCC 10.2: -funroll-loops -O2 -ffast-math -march=haswell -std=c++2a -fopenmp] + [__attribute__((optimize("O2","fast-math","unroll-loops")))] DOES unroll. 3. [GCC 9.3: -O2 -ffast-math -march=haswell -std=c++2a -fopenmp] + [__attribute__((optimize("O2","fast-math","unroll-loops")))] DOES unroll. It seems that in GCC 10.x, you have to place -funroll-loops in the compilation string, and function level __attribute__s are ignored? PS: Code in godbolt is a matrix multiplication kernel. It multiplies 1 column * 1 row of a matrix.
[Bug c++/96535] [10/11 Regression] GCC 10 ignoring function __attribute__ optimize for all x86 since r11-1019
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96535 --- Comment #3 from Daniel Han-Chen --- Oh lolll I was just about to add a comment about further experimentation Seems like Jakub and Hongtao have found the root cause of the issues? Anyways what I was gonna write [probs not necessary anymore so no need to read] """ Anyways from more experimentation, it seems like O1, O2, O3 are not ignored, but the unrolling only gets turned on via O3. So if one passes O1, O2 in __attribute__, but the command line is O3, the function still unrolls. For eg, when commandline is O3, in GCC 9, __attribute__((optimize("O1 / 2")) causes code to use VMULPS and VADDPS with an unroll factor of 1. However in GCC 10.x, when the commandline is O3, VMULPS and VADDPS is used (optimize("O1/2")), however, unrolling is still done??? Passing "no-unroll-loops" in attribute also does not work. It seems like the commandline O3 overrides unrolling or something? The resulting assembly does use VMULPS/VADDPS and not VFMADDPS for O1/O2, but O3 causes an unrolling factor of 6 or so [it should be 1] https://gcc.godbolt.org/z/qb3d5M for new example. """
[Bug c++/98317] New: Vector Extensions aligned(1) not generating unaligned loads/stores
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98317 Bug ID: 98317 Summary: Vector Extensions aligned(1) not generating unaligned loads/stores Product: gcc Version: 10.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: danielhanchen at gmail dot com Target Milestone: --- The ordering of aligned(1) causes GCC to generate movaps / movups. typedef float float128_tv1__attribute__ ((aligned(1), vector_size(16))); typedef float float128_tv2__attribute__ ((vector_size(16), aligned(1))); float128_tv1 provides MOVAPS float128_tv2 provides MOVUPS It seems like the ordering of the arguments changes the assembly. https://gcc.godbolt.org/z/5qs7e7 It seems like GCC 10.2 and 9.2 all have this issue. Unless if this was already documentated, this issue can cause massive issues if memory is unaligned and an aligned load/store is used instead.
[Bug c++/98317] Vector Extensions aligned(1) not generating unaligned loads/stores
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98317 Daniel Han-Chen changed: What|Removed |Added CC||danielhanchen at gmail dot com --- Comment #1 from Daniel Han-Chen --- https://gcc.godbolt.org/z/sGWevT I also tried separating the __attribute__s typedef float float128_tv1__attribute__ ((aligned(1), vector_size(16))); typedef float float128_tv2__attribute__ ((vector_size(16), aligned(1))); typedef float float128_tv3__attribute__((aligned(1))) __attribute__ ((vector_size(16))); typedef float float128_tv4__attribute__ ((vector_size(16))) __attribute__((aligned(1))); aligned as the first argument still fails.
[Bug c++/98317] Vector Extensions aligned(1) not generating unaligned loads/stores
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98317 --- Comment #3 from Daniel Han-Chen --- Oh ok then. It's cause I was trying to do unaligned loads by following: https://stackoverflow.com/questions/9318115/loading-data-for-gccs-vector-extensions In it, it mentioned using typedef char __attribute__ ((vector_size (16),aligned (1))) unaligned_byte16, which works, though the other way does not. But I like your solution by declaring the type as aligned(1) separately.
[Bug c++/98348] New: GCC 10.2 AVX512 Mask regression from GCC 9
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98348 Bug ID: 98348 Summary: GCC 10.2 AVX512 Mask regression from GCC 9 Product: gcc Version: 10.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: danielhanchen at gmail dot com Target Milestone: --- In GCC 9, vector comparisons on 128 and 256bit vectors on a AVX512 machine used vpcmpeqd without any masks. In GCC 10, for 128bit and 256bit vectors, AVX512 mask instructions are used. https://gcc.godbolt.org/z/1sPzM5 GCC 10 should follow GCC 9 for vector comparisons when a mask is not needed. The reason why is https://uops.info/table.html shows that using mask registers makes 128/256/512 operations have a throughput of 1 and a latency of 3. However, using a vector comparison directly has a throughput of 2 and a latency of 1.
[Bug c++/98348] GCC 10.2 AVX512 Mask regression from GCC 9
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98348 --- Comment #1 from Daniel Han-Chen --- I also just noticed that in GCC 10, an extra movdqa is done, which is also not necessary.
[Bug c++/98317] Vector Extensions aligned(1) not generating unaligned loads/stores
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98317 Daniel Han-Chen changed: What|Removed |Added Resolution|--- |WORKSFORME Status|UNCONFIRMED |RESOLVED --- Comment #4 from Daniel Han-Chen --- Jakub mentioned his solution, so all good now.
[Bug c++/98387] New: GCC >= 6 cannot inline _mm_cmp_ps on SSE targets
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98387 Bug ID: 98387 Summary: GCC >= 6 cannot inline _mm_cmp_ps on SSE targets Product: gcc Version: 10.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: danielhanchen at gmail dot com Target Milestone: --- https://gcc.godbolt.org/z/493ead GCC since version 6.1 cannot inline _mm_cmp_ps on targets supporting only SSE (Nehalem, Tremont etc). From >= SandyBridge, everything inlines fine. _mm_cmp_ps is called by using it as a function argument (ie auto function). All SSE only machines use a jmp to _mm_cmp_ps, but it should be inlined. O3 ffast-math is also used, and the function is declared inline.
[Bug target/98387] GCC >= 6 cannot inline _mm_cmp_ps on SSE targets
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98387 --- Comment #1 from Daniel Han-Chen --- Oh I just noticed _mm_cmp_ps isn't actually supported for SSE targets even in Intel's Intrinsics Guide: [_mm_cmp_ps first was supported in AVX] https://software.intel.com/sites/landingpage/IntrinsicsGuide/#expand=5236,827,33,5224,447,456,4085,3864,5224,4179,4118,4115,4115,4121,3864,3870,5579,2030,3319,2809,4127,5156,4179,4201,3536,3539,3533,2184,3505,3533,3542,3505,3533,1606,4174,2809,5576,5578,2063,3895,3893,2484,3864,4076,3864,687,689,689,3544,771,1648,1647,5878,5903,743&techs=SSE,SSE2,SSE3,SSSE3,SSE4_1,SSE4_2&text=cmpps error: inlining failed in call to always_inline '__m128 _mm_cmp_ps(__m128, __m128, int)': target specific option mismatch 390 | _mm_cmp_ps (__m128 __X, __m128 __Y, const int __P) _mm_cmp[*]_ps ie _mm_cmpeq_ps and derivatives successfully inline.
[Bug target/98387] GCC >= 6 cannot inline _mm_cmp_ps on SSE targets
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98387 --- Comment #3 from Daniel Han-Chen --- (In reply to H.J. Lu from comment #2) > _mm_cmp_ps is an AVX intrinsic. Yep noticed _mm_cmp_ps is only in AVX. The weird part is it actually causes no errors when used on SSE only targets [ie Nehalem], and GCC continues compiling. Is this supposed to be normal behaivor?
[Bug target/98387] GCC >= 6 cannot inline _mm_cmp_ps on SSE targets
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98387 --- Comment #5 from Daniel Han-Chen --- (In reply to H.J. Lu from comment #4) > (In reply to Daniel Han-Chen from comment #3) > > (In reply to H.J. Lu from comment #2) > > > _mm_cmp_ps is an AVX intrinsic. > > > > Yep noticed _mm_cmp_ps is only in AVX. The weird part is it actually causes > > no errors when used on SSE only targets [ie Nehalem], and GCC continues > > compiling. > > > > Is this supposed to be normal behaivor? > > GCC treats it like an undefined function. Thanks! Sorry I probably might have asked some really dumb questions. But also thanks for taking your time in answering them! :) Appreciate it!