https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87455
Bug ID: 87455 Summary: sse_packed_single_insn_optimal is suboptimal on Zen Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: fanael4 at gmail dot com Target Milestone: --- GCC by default enables -mtune-ctrl=sse_packed_single_insn_optimal on -mtune=znver1, even though that microarchitecture doesn't like it for the same reason Intel's microarchitectures don't: there's additional latency for domain crossing operations, using e.g. xorps for integer data costs one cycle more than using pxor. Example code: #include <immintrin.h> int main() { auto x = _mm_setr_epi32(1, 2, 3, 4); auto y = _mm_setr_epi32(5, 6, 7, 8); auto z = _mm_setr_epi32(9, 10, 11, 12); for(int i = 0; i < 1000000000; ++i) { x = _mm_add_epi32(x, y); y = _mm_xor_si128(y, z); z = _mm_add_epi32(z, x); x = _mm_xor_si128(x, y); y = _mm_add_epi32(y, z); z = _mm_xor_si128(z, x); } asm volatile("" :: "m"(x), "m"(y), "m"(z)); } Compiled with GCC 8.2, with -O3 -mtune=znver1 running it yields the following perf counters: $ perf stat -e task-clock,cycles,instructions ./a.out Performance counter stats for './a.out': 1 193,69 msec task-clock:u # 0,989 CPUs utilized 4 040 330 384 cycles:u # 3386697,723 GHz 10 002 005 027 instructions:u # 2,48 insn per cycle 1,206801245 seconds time elapsed 1,190625000 seconds user 0,003995000 seconds sys However, the code compiled with -O3 -mtune=znver1 -mtune-ctrl=^sse_packed_single_insn_optimal is significantly faster: $ perf stat -e task-clock,cycles,instructions ./a.out Performance counter stats for './a.out': 894,08 msec task-clock:u # 0,998 CPUs utilized 3 012 492 242 cycles:u # 3369678,123 GHz 10 002 004 492 instructions:u # 3,32 insn per cycle 0,895728255 seconds time elapsed 0,894688000 seconds user 0,000000000 seconds sys This is on a Ryzen 5 2500U.