https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67072
Bug ID: 67072 Summary: Slow code generated for getting each byte of a 64bit register as a LUT index. Product: gcc Version: 4.9.2 Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- gcc produces much slower code than what I wrote by hand, for a function that does LUT lookups for each byte of source data. intrinsics version (bad code): https://github.com/pcordes/par2-asm-experiments/blob/master/intrin-pinsrw.c hand-written asm version (good code): https://github.com/pcordes/par2-asm-experiments/blob/master/asm-pinsrw.s Both versions maintain two dependency chains to avoid latency bottlenecks. My hand-written asm uses movzx from dl and dh, then shifts by 16 to bring the next two bytes into dl/dh. movzx %dl, %eax movzx %dh, %ebx shr $16, %rdx ; set up for next block of this dep chain pinsrw $1, 0x0000(%rsi, %rax, 4), %xmm0 pinsrw $1, 0x0000(%rdi, %rbx, 4), %xmm1 repeat for $2, etc. Here's a cut-down version that doesn't need any extra headers: #include <emmintrin.h> #include <stdint.h> union wordbytes { uint16_t w; uint8_t b[2]; }; void rs_process_pinsrw_intrin(void* dstvoid, const void* srcvoid, size_t size, const uint32_t* LH) { const uint64_t *src = srcvoid; __m128i d, l, h; __m128i *dst = dstvoid; for (size_t i = 0; i < size/sizeof(d) ; i+=1) { #define LO(x) ((uint8_t)x) //#define HI(x) ((uint8_t)((x >> 8) & 0xff)) //#define HI(x) ( (uint8_t) (((uint16_t)x) >> 8) ) // This gets gcc to emit movz %bh, %eax, unlike the & 0xff version // u8(((u16) s) >> 8) #define HI(x) (( (union wordbytes) ((uint16_t)x) ).b[1]) // fastest code, but still horrible; WAY too much useless mov uint64_t s0 = src[i*2 + 0]; uint64_t s1 = src[i*2 + 1]; l = _mm_cvtsi32_si128( LH[ LO(s0)] ); // movd the lookup for the l/h byte of the first word h = _mm_cvtsi32_si128( LH[256 + HI(s0)] ); s0 >>= 16; l = _mm_insert_epi16(l, LH[ LO(s1)], 4); h = _mm_insert_epi16(h, LH[256 + HI(s1)], 4); s1 >>= 16; l = _mm_insert_epi16(l, LH[ LO(s0)], 1); h = _mm_insert_epi16(h, LH[256 + HI(s0)], 1); s0 >>= 16; l = _mm_insert_epi16(l, LH[ LO(s1)], 5); h = _mm_insert_epi16(h, LH[256 + HI(s1)], 5); s1 >>= 16; l = _mm_insert_epi16(l, LH[ LO(s0)], 2); h = _mm_insert_epi16(h, LH[256 + HI(s0)], 2); s0 >>= 16; l = _mm_insert_epi16(l, LH[ LO(s1)], 6); h = _mm_insert_epi16(h, LH[256 + HI(s1)], 6); s1 >>= 16; l = _mm_insert_epi16(l, LH[ LO(s0)], 3); h = _mm_insert_epi16(h, LH[256 + HI(s0)], 3); l = _mm_insert_epi16(l, LH[ LO(s1)], 7); h = _mm_insert_epi16(h, LH[256 + HI(s1)], 7); #undef LO #undef HI d = _mm_xor_si128(l, h); // gcc 4.9 uses a 3rd xmm reg for no reason, even with l instead of d d = _mm_xor_si128(d, dst[i]); _mm_store_si128(&dst[i], d); } } alias m=less gcc -std=gnu11 -O3 -o- -S intrin-pinsrw-simple.c 2>&1 | m gcc 4.9.2's output for this includes a lot of extra mov instructions. Instead of shifting s0 and s1 in the same register, it copies them to another register and then shifts that by 16, 32, or 48. Maybe gcc would have done better if it had decided to put s0 and s1 into registers that have a high 16 (%ah, %bh, %ch, %dh), instead of r9 and r14. It runs 35% slower than my hand-tuned asm version, on Intel Sandybridge. uop throughput is the bottleneck here, because there's a mix of load and ALU uops, and the dep chains are short enough to not be a problem. (4 fused-domain uops per cycle). My asm version does more PINSRW xmm dep chains in parallel, and merges at the end with punpcklqdq, but that change only gained a couple %, IIRC.