[Bug c++/96535] New: GCC 10 ignoring function __attribute__ optimize for all x86

2020-08-08 Thread danielhanchen at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96535

Bug ID: 96535
   Summary: GCC 10 ignoring function __attribute__ optimize for
all x86
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: danielhanchen at gmail dot com
  Target Milestone: ---

Hey GCC team!

In GCC 10.x, it seems like any argument to __attribute__((optimize(...)) is
ignored at the function level. GCC 9.x and previous do not have this issue. [Or
maybe only -funroll-loops is ignored not 100% sure]

Detailed example at: https://gcc.godbolt.org/z/PTK4WE

3 Scenarios

1. [GCC 10.2: -O2 -ffast-math -march=haswell -std=c++2a -fopenmp] +
[__attribute__((optimize("O2","fast-math","unroll-loops")))] DOES NOT unroll.

2. [GCC 10.2: -funroll-loops -O2 -ffast-math -march=haswell -std=c++2a
-fopenmp] + [__attribute__((optimize("O2","fast-math","unroll-loops")))] DOES
unroll.

3. [GCC 9.3:  -O2 -ffast-math -march=haswell -std=c++2a -fopenmp] +
[__attribute__((optimize("O2","fast-math","unroll-loops")))] DOES unroll.

It seems that in GCC 10.x, you have to place -funroll-loops in the compilation
string, and function level __attribute__s are ignored?

PS: Code in godbolt is a matrix multiplication kernel. It multiplies 1 column *
1 row of a matrix.

[Bug c++/96535] [10/11 Regression] GCC 10 ignoring function __attribute__ optimize for all x86 since r11-1019

2020-08-11 Thread danielhanchen at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96535

--- Comment #3 from Daniel Han-Chen  ---
Oh lolll I was just about to add a comment about further experimentation

Seems like Jakub and Hongtao have found the root cause of the issues?

Anyways what I was gonna write [probs not necessary anymore so no need to read]

"""
Anyways from more experimentation, it seems like O1, O2, O3 are not ignored,
but the unrolling only gets turned on via O3. So if one passes O1, O2 in
__attribute__, but the command line is O3, the function still unrolls.

For eg, when commandline is O3, in GCC 9, __attribute__((optimize("O1 / 2"))
causes code to use VMULPS and VADDPS with an unroll factor of 1.

However in GCC 10.x, when the commandline is O3, VMULPS and VADDPS is used
(optimize("O1/2")), however, unrolling is still done??? Passing
"no-unroll-loops" in attribute also does not work.

It seems like the commandline O3 overrides unrolling or something? The
resulting assembly does use VMULPS/VADDPS and not VFMADDPS for O1/O2, but O3
causes an unrolling factor of 6 or so [it should be 1]

https://gcc.godbolt.org/z/qb3d5M for new example.
"""

[Bug c++/98317] New: Vector Extensions aligned(1) not generating unaligned loads/stores

2020-12-16 Thread danielhanchen at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98317

Bug ID: 98317
   Summary: Vector Extensions aligned(1) not generating unaligned
loads/stores
   Product: gcc
   Version: 10.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: danielhanchen at gmail dot com
  Target Milestone: ---

The ordering of aligned(1) causes GCC to generate movaps / movups.

typedef float   float128_tv1__attribute__ ((aligned(1), vector_size(16)));
typedef float   float128_tv2__attribute__ ((vector_size(16), aligned(1)));

float128_tv1 provides MOVAPS
float128_tv2 provides MOVUPS

It seems like the ordering of the arguments changes the assembly.

https://gcc.godbolt.org/z/5qs7e7

It seems like GCC 10.2 and 9.2 all have this issue.
Unless if this was already documentated, this issue can cause massive issues if
memory is unaligned and an aligned load/store is used instead.

[Bug c++/98317] Vector Extensions aligned(1) not generating unaligned loads/stores

2020-12-16 Thread danielhanchen at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98317

Daniel Han-Chen  changed:

   What|Removed |Added

 CC||danielhanchen at gmail dot com

--- Comment #1 from Daniel Han-Chen  ---
https://gcc.godbolt.org/z/sGWevT

I also tried separating the __attribute__s


typedef float   float128_tv1__attribute__ ((aligned(1), vector_size(16)));
typedef float   float128_tv2__attribute__ ((vector_size(16), aligned(1)));
typedef float   float128_tv3__attribute__((aligned(1))) __attribute__
((vector_size(16)));
typedef float   float128_tv4__attribute__ ((vector_size(16)))
__attribute__((aligned(1)));


aligned as the first argument still fails.

[Bug c++/98317] Vector Extensions aligned(1) not generating unaligned loads/stores

2020-12-16 Thread danielhanchen at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98317

--- Comment #3 from Daniel Han-Chen  ---
Oh ok then.

It's cause I was trying to do unaligned loads by following:
https://stackoverflow.com/questions/9318115/loading-data-for-gccs-vector-extensions

In it, it mentioned using typedef char __attribute__ ((vector_size (16),aligned
(1))) unaligned_byte16, which works, though the other way does not.

But I like your solution by declaring the type as aligned(1) separately.

[Bug c++/98348] New: GCC 10.2 AVX512 Mask regression from GCC 9

2020-12-17 Thread danielhanchen at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98348

Bug ID: 98348
   Summary: GCC 10.2 AVX512 Mask regression from GCC 9
   Product: gcc
   Version: 10.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: danielhanchen at gmail dot com
  Target Milestone: ---

In GCC 9, vector comparisons on 128 and 256bit vectors on a AVX512 machine used
vpcmpeqd without any masks.

In GCC 10, for 128bit and 256bit vectors, AVX512 mask instructions are used.
https://gcc.godbolt.org/z/1sPzM5

GCC 10 should follow GCC 9 for vector comparisons when a mask is not needed.

The reason why is https://uops.info/table.html shows that using mask registers
makes 128/256/512 operations have a throughput of 1 and a latency of 3.

However, using a vector comparison directly has a throughput of 2 and a latency
of 1.

[Bug c++/98348] GCC 10.2 AVX512 Mask regression from GCC 9

2020-12-17 Thread danielhanchen at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98348

--- Comment #1 from Daniel Han-Chen  ---
I also just noticed that in GCC 10, an extra movdqa is done, which is also not
necessary.

[Bug c++/98317] Vector Extensions aligned(1) not generating unaligned loads/stores

2020-12-18 Thread danielhanchen at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98317

Daniel Han-Chen  changed:

   What|Removed |Added

 Resolution|--- |WORKSFORME
 Status|UNCONFIRMED |RESOLVED

--- Comment #4 from Daniel Han-Chen  ---
Jakub mentioned his solution, so all good now.

[Bug c++/98387] New: GCC >= 6 cannot inline _mm_cmp_ps on SSE targets

2020-12-18 Thread danielhanchen at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98387

Bug ID: 98387
   Summary: GCC >= 6 cannot inline _mm_cmp_ps on SSE targets
   Product: gcc
   Version: 10.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: danielhanchen at gmail dot com
  Target Milestone: ---

https://gcc.godbolt.org/z/493ead

GCC since version 6.1 cannot inline _mm_cmp_ps on targets supporting only SSE
(Nehalem, Tremont etc). From >= SandyBridge, everything inlines fine.

_mm_cmp_ps is called by using it as a function argument (ie auto function).

All SSE only machines use a jmp to _mm_cmp_ps, but it should be inlined.

O3 ffast-math is also used, and the function is declared inline.

[Bug target/98387] GCC >= 6 cannot inline _mm_cmp_ps on SSE targets

2020-12-18 Thread danielhanchen at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98387

--- Comment #1 from Daniel Han-Chen  ---
Oh I just noticed _mm_cmp_ps isn't actually supported for SSE targets even in
Intel's Intrinsics Guide: [_mm_cmp_ps first was supported in AVX]

https://software.intel.com/sites/landingpage/IntrinsicsGuide/#expand=5236,827,33,5224,447,456,4085,3864,5224,4179,4118,4115,4115,4121,3864,3870,5579,2030,3319,2809,4127,5156,4179,4201,3536,3539,3533,2184,3505,3533,3542,3505,3533,1606,4174,2809,5576,5578,2063,3895,3893,2484,3864,4076,3864,687,689,689,3544,771,1648,1647,5878,5903,743&techs=SSE,SSE2,SSE3,SSSE3,SSE4_1,SSE4_2&text=cmpps



error: inlining failed in call to always_inline '__m128 _mm_cmp_ps(__m128,
__m128, int)': target specific option mismatch
  390 | _mm_cmp_ps (__m128 __X, __m128 __Y, const int __P)


_mm_cmp[*]_ps ie _mm_cmpeq_ps and derivatives successfully inline.

[Bug target/98387] GCC >= 6 cannot inline _mm_cmp_ps on SSE targets

2020-12-19 Thread danielhanchen at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98387

--- Comment #3 from Daniel Han-Chen  ---
(In reply to H.J. Lu from comment #2)
> _mm_cmp_ps is an AVX intrinsic.

Yep noticed _mm_cmp_ps is only in AVX. The weird part is it actually causes no
errors when used on SSE only targets [ie Nehalem], and GCC continues compiling.

Is this supposed to be normal behaivor?

[Bug target/98387] GCC >= 6 cannot inline _mm_cmp_ps on SSE targets

2020-12-19 Thread danielhanchen at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98387

--- Comment #5 from Daniel Han-Chen  ---
(In reply to H.J. Lu from comment #4)
> (In reply to Daniel Han-Chen from comment #3)
> > (In reply to H.J. Lu from comment #2)
> > > _mm_cmp_ps is an AVX intrinsic.
> > 
> > Yep noticed _mm_cmp_ps is only in AVX. The weird part is it actually causes
> > no errors when used on SSE only targets [ie Nehalem], and GCC continues
> > compiling.
> > 
> > Is this supposed to be normal behaivor?
> 
> GCC treats it like an undefined function.

Thanks! Sorry I probably might have asked some really dumb questions. But also
thanks for taking your time in answering them! :) Appreciate it!