https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66275
Bug ID: 66275 Summary: __attribute__((sysv_abi)) with x86_64-w64-mingw32-gcc generates incorrect code Product: gcc Version: 4.9.2 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Created attachment 35615 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=35615&action=edit main.c, asm-test.h, and less-stripped-down intrin-pinsrw.c gcc (Ubuntu 4.9.2-10ubuntu13) 4.9.2 w64 mingw gcc is mis-compiling functions declared with __attribute__((sysv_abi)). Registers holding args are getting clobbered with temp values, but then used as if they still held function args. This code is mis-compiled by x86_64-w64-mingw32-gcc -g -Wall -funroll-loops -O3 -std=gnu11 intrin-pinsrw.c ... main.c cat > intrin-pinsrw.c <<"EOF" #include <emmintrin.h> #include <stdint.h> #define SYSV_ABI __attribute__((sysv_abi)) union wordbytes { uint16_t w; uint8_t b[2]; }; void SYSV_ABI rs_process_pinsrw_intrin(void* dstvoid, const void* srcvoid, size_t size, const uint32_t* LH) { const uint64_t *src = srcvoid; __m128i d, l, h; __m128i *dst = dstvoid; for (size_t i = 0; i < size/sizeof(d) ; i+=1) { uint64_t s0 = src[i*2 + 0]; #define LO(x) ((uint8_t)x) #define HI(x) (( (union wordbytes) ((uint16_t)x) ).b[1]) l = _mm_cvtsi32_si128( LH[ LO(s0)] ); h = _mm_cvtsi32_si128( LH[256 + HI(s0)] ); s0 >>= 16; // commented-out code that wasn't needed to trigger the bug d = _mm_xor_si128(l, h); d = _mm_xor_si128(d, dst[i]); _mm_store_si128(&dst[i], d); } } EOF x86_64-w64-mingw32-gcc -Wall -funroll-loops -O3 -std=gnu11 intrin-pinsrw.c -S search for "si" in the output, and notice how %rsi is used as a pointer to load src data, but also used as a tmp. .L5: movq (%rsi,%r10), %rdx movzbl %dl, %esi ## clobbered here movzbl %dh, %eax movd (%rcx,%rsi,4), %xmm8 movd 1024(%rcx,%rax,4), %xmm9 pxor %xmm8, %xmm9 pxor (%rdi,%r10), %xmm9 movaps %xmm9, (%rdi,%r10) movq 16(%rsi,%r10), %rdx ## not restored after use as a tmp ... %rdi is also clobbered and then used as a pointer again, later in the loop. In the AMD64 sysv ABI, %rdi holds the 1st arg, %rsi holds the 2nd, %rdx holds the 3rd, and %rcx holds the 4th. (and see attached for a less-stripped-down and a main.c to call it) -funroll-loops is needed to trigger the problem on this reduced test-case. It might not be needed with a bigger function that would run out of registers without unrolling. Even the less-stripped-down version in the attachment runs fine under WINE when compiled without -funroll-loops. If testing by actually running the code with WINE, note that wine and wine64 seem to have incompatible requirements for the contents of ~/.wine. I eventually made a wrapper like: #!/bin/sh WINEPREFIX=~/.wine64 /usr/bin/wine64 "$@" For anyone curious why I had this code in the first place: This code is from par2 (the reed-solomon inner-loop that multiplies a buffer of GF16 values by a constant factor, using a pre-computed lookup table). See https://github.com/pcordes/par2-asm-experiments I have some functions written directly in asm. They're written for the Linux SysV ABI. I could make an alternate entry point for win ABI, with extra mov instructions, and push/pop the extra regs that are callee-save in MS-ABI but not SYSV. Or I could just declare their prototypes with __attribute__((sysv_abi)), and gcc will call them properly. I tried making a version of the function using intrinsics, so of course my test harness needs to call it, too. Since I call via function pointer, I had to declare it the same way. It was working fine until a recent change to the code of the function. Old version that didn't tickle this bug (partial diff): - uint64_t s0, s1; - for (size_t i = 0; i < size ; i+=16) { - s0 = *((uint64_t*) ((char*)srcvoid + i)); // byte address math, 64bit load - s1 = *((uint64_t*) ((char*)srcvoid+8 + i)); // byte address math, 64bit load - d = _mm_xor_si128(d, *(dst+i/16)); - _mm_store_si128(dst+i/16, d); I wrote it in that ugly way initially because I was basically porting my ASM code to intrinsics. BTW, the results were terrible. gcc generates ridiculously bad code for getting the src bytes into zero-extended 64bit regs, for use as scaled-offsets in an address, compared to movzx %dl, %eax movxz %dh, %ebx shr $16, %rdx use rax/rbx movzx %dl, %eax movxz %dh, %ebx shr $16, %rdx use rax/rbx ... gcc never just shifts the reg holding src data. Instead if copies, and shifts the copy by $16, $32, or $48. gcc's code is about 35% slower than the hand-written version, even letting it use avx so it doesn't emit useless movdqa instructions when it doesn't realize that an old value is no longer needed. Just un-comment the mostly-commented loop body in the testcase (attachment version). Anyway, slow code is off-topic, this bug is about wrong code!