https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71903

            Bug ID: 71903
           Summary: Wrong opcode using x86 SSE _mm_cmpge_ps intrinsics
           Product: gcc
           Version: 4.8.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: c++
          Assignee: unassigned at gcc dot gnu.org
          Reporter: carlosrafael.prog at gmail dot com
  Target Milestone: ---

I have the following code:

float *previousM = ...;
float *fft = ...;

for (int32_t i = 0; i < 256; i += 8) {
        __m128 m0 = _mm_load_ps(previousM);
        __m128 m1 = _mm_load_ps(previousM + 4);
        previousM += 8;

        __m128 old0 = _mm_load_ps(fft);
        __m128 old1 = _mm_load_ps(fft + 4);

        __m128 geq0 = _mm_cmpge_ps(m0, old0);
        __m128 geq1 = _mm_cmpge_ps(m1, old1);
        ...
}

Since the code was behaving rather strangely, I decided to generate and read
its disassembly (below is the snippet that drew my attention):

extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__,
__artificial__)) _mm_cmpge_ps (__m128 __A, __m128 __B)
{
  return (__m128) __builtin_ia32_cmpgeps ((__v4sf)__A, (__v4sf)__B);
  9f:   0f c2 dd 02             cmpleps %xmm5,%xmm3

Please, notice that this is not a bug in the disassembler because Intel docs
state that CMPLEPS xmm1, xmm2 becomes CMPPS xmm1, xmm2, 2

Also, this is not some weird optimization or anything else, because even if the
compiler had decided to switch m0 with old0, the opposite of >= (ge) is < (lt)
and not <= (le), as the disassembly shows.

In order to make the code work properly, I manually replaced these two lines in
my code

        __m128 geq0 = _mm_cmpge_ps(m0, old0);
        __m128 geq1 = _mm_cmpge_ps(m1, old1);

with these two lines

        __m128 geq0 = _mm_cmplt_ps(old0, m0);
        __m128 geq1 = _mm_cmplt_ps(old1, m1);

After that change, the disassembly became

extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__,
__artificial__)) _mm_cmplt_ps (__m128 __A, __m128 __B)
{
  return (__m128) __builtin_ia32_cmpltps ((__v4sf)__A, (__v4sf)__B);
  8d:   0f c2 e3 01             cmpltps %xmm3,%xmm4

Just as an extra piece of information:
- I am using the gcc bundled with Android build tools, and since there are two
executable files, I do not know for sure if the version of the gcc being used
is "4.8" or "4.9 20140827"
- I am compiling under a 64-bit Windows 10, targeting a 32-bit x86 Android app
- The gcc used (both 4.8 and 4.9) are inside the folder windows-x86_64 (which
makes me believe I am using a 64-bit version of gcc)

Reply via email to