https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87455

            Bug ID: 87455
           Summary: sse_packed_single_insn_optimal is suboptimal on Zen
           Product: gcc
           Version: unknown
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: fanael4 at gmail dot com
  Target Milestone: ---

GCC by default enables -mtune-ctrl=sse_packed_single_insn_optimal on
-mtune=znver1, even though that microarchitecture doesn't like it for the same
reason Intel's microarchitectures don't: there's additional latency for domain
crossing operations, using e.g. xorps for integer data costs one cycle more
than using pxor.

Example code:

#include <immintrin.h>

int main() {
    auto x = _mm_setr_epi32(1, 2, 3, 4);
    auto y = _mm_setr_epi32(5, 6, 7, 8);
    auto z = _mm_setr_epi32(9, 10, 11, 12);

    for(int i = 0; i < 1000000000; ++i) {
        x = _mm_add_epi32(x, y);
        y = _mm_xor_si128(y, z);
        z = _mm_add_epi32(z, x);
        x = _mm_xor_si128(x, y);
        y = _mm_add_epi32(y, z);
        z = _mm_xor_si128(z, x);
    }

    asm volatile("" :: "m"(x), "m"(y), "m"(z));
}

Compiled with GCC 8.2, with -O3 -mtune=znver1 running it yields the following
perf counters:

$ perf stat -e task-clock,cycles,instructions ./a.out

 Performance counter stats for './a.out':

          1 193,69 msec task-clock:u              #    0,989 CPUs utilized      
     4 040 330 384      cycles:u                  # 3386697,723 GHz             
    10 002 005 027      instructions:u            #    2,48  insn per cycle     

       1,206801245 seconds time elapsed

       1,190625000 seconds user
       0,003995000 seconds sys

However, the code compiled with -O3 -mtune=znver1
-mtune-ctrl=^sse_packed_single_insn_optimal is significantly faster:

$ perf stat -e task-clock,cycles,instructions ./a.out

 Performance counter stats for './a.out':

            894,08 msec task-clock:u              #    0,998 CPUs utilized      
     3 012 492 242      cycles:u                  # 3369678,123 GHz             
    10 002 004 492      instructions:u            #    3,32  insn per cycle     

       0,895728255 seconds time elapsed

       0,894688000 seconds user
       0,000000000 seconds sys

This is on a Ryzen 5 2500U.

Reply via email to