https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80833
Bug ID: 80833 Summary: 32-bit x86 causes store-forwarding stalls for int64_t -> xmm Product: gcc Version: 8.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: i?86-*-* This affects 64-bit atomic loads/stores, as well as _mm_set_epi64x intrinsics. gcc -m32 copies int64_t data into xmm registers with the worst possible strategy: two 32-bit stores and a 64-bit vector load, causing a store-forwarding stall. Similarly, for getting 64-bit integers back out of xmm registers, gcc's store/reload strategy may not be optimal on Intel CPUs. (But doesn't cause a store-forwarding stall). #include <immintrin.h> __m128i combine64(int64_t a, int64_t b) { return _mm_set_epi64x(b, a|1); } b is loaded directly from memory, but the low half of `a` is modified so it can't be, letting us observe gcc's int64->xmm strategy. gcc8-snapshot -m32 -march=haswell -mno-avx -O3 emits: subl $28, %esp movl 32(%esp), %eax movl 36(%esp), %edx # the untouched upper half of `a` orl $1, %eax # `a` is in integer regs. The next three lines are gcc's typical pattern for int64 -> xmm movl %edx, 12(%esp) movl %eax, 8(%esp) movq 8(%esp), %xmm0 # guaranteed store-forwarding stall, except with -mtune=atom movhps 40(%esp), %xmm0 # store-forwarding stall if the caller used scalar stores. addl $28, %esp ret A slight improvement would be to orl $1, 4(%esp) to do a|1 in-place, instead of copying it. But that still has an extra load-store round-trip, so it's not good even on Atom where it wouldn't cause a store-forwarding stall. ----- For data coming from integer registers, clearly we should use whatever strategy is optimal for _mm_set_epi32, as discussed in https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80820). movd / pinsrd should be good for -mtune=intel, and Ryzen. (But in a case like this, with one simple integer operation on data starting in memory, avoiding integer registers entirely by doing that operation with a vector-integer instruction is worth considering, see below). ---- For integers that are coming from memory, it matters a lot whether we expect them to have been written recently with narrow stores. Except with -mtune=atom, which doesn't stall for narrow stores -> wide load. e.g. -mtune=atom could use mov $1, %eax movd %eax, %xmm0 # or load the vector constant from memory por 4(%esp), %xmm0 ret If the two integer aren't adjacent, movq / movhps is good. Other CPUs can use this, too, if reading memory that we expect wasn't recently written with narrow stores. Doing 64-bit integer ops in vector regs is an even bigger win for ops with carry from low to high, like shift or add. For example, if it was `a++` instead of `a|1`: movq 4(%esp), %xmm0 # load `a` pcmpeqd %xmm1,%xmm1 psubq %xmm1,%xmm0 # a -= -1. 1 uop vs. 3 for add/adc on Haswell movhps 12(%esp), %xmm0 # merge in `b` ret If we were worried about store-forwarding stalls, then get `a` into xmm0 in two halves before the psubq. Scalar 64-bit integer ops in vector regs may be useful in general in 32-bit code in some cases, especially if it helps with register pressure. ---- For function args, especially near the start of a function, we should assume that anything other than separate 32-bit loads will cause a store-forwarding stall, incurring an extra 10 to 12 cycles of latency beyond the usual store-forwarding latency, depending on uarch. i.e. very expensive, and worth spending a lot of instructions to avoid if it's at all likely, maybe even if not part of a long dep chain (since out-of-order execution may have trouble hiding that much latency). clang4.0 -O3 -march=sandybridge -mno-avx emits this, which is very good if we have to assume that the function args were recently stored in 32-bit chunks: movl 4(%esp), %eax orl $1, %eax movd %eax, %xmm0 pinsrd $1, 8(%esp), %xmm0 # taking the unmodified half of `a` directly from the original memory location is a good optimization pinsrd $2, 12(%esp), %xmm0 pinsrd $3, 16(%esp), %xmm0 retl pinsrd with a memory source is 2 fused-domain uops, but only one of them is an ALU uop (for the shuffle port). The other is the load. It never micro-fuses. gcc's usual _mm_set_epi32 strategy of doing two 64-bit halves and merging with a shuffle would also work well with memory source data. For CPUs where int->xmm is not fast, doing the OR (or whatever other ALU operation) with a vector instruction is even more attractive than on Intel, even if we still have to load 32 bits at a time. It's also good on Intel CPUs if we can hoist the vector constant out of a loop, since int->xmm needs a port5 uop which competes with shuffles, especially on Haswell and later that only have 1 shuffle unit. movd 4(%esp), %xmm0 # a (low half) movd 12(%esp), %xmm2 # b pinsrd $1, 16(%esp), %xmm2 pcmpeqd %xmm1,%xmm1 psrld $31, %xmm1 # or load a constant from memory por %xmm1, %xmm0 # a |= 1 pinsrd $1, 8(%esp), %xmm0 # then merge the high half of `a`, replacing the garbage in element 1 punpcklqdq %xmm2,%xmm0 retl This only has 3 port5 uops, and has a latency on haswell of 3 cycles from the first 2 loads being ready. Since out-of-order CPUs typically run uops in oldest-ready order (with adjustments for higher-latency uops to avoid writeback conflicts), I scheduled this so the two movd loads are first, allowing the three port5 uops to run in three consecutive cycles. (First the pinsrd of the high half of b, since the two load uops from the pinsrd instructions should be ready the cycle after the two movd uops.) I have no idea if this really would help avoid extra resource-conflict latency for the critical path, but it can't hurt. ---------- This also affects 64-bit atomic stores and loads. #include <atomic> #include <stdint.h> int64_t load64(std::atomic<int64_t> *p) { return p->load(std::memory_order_acquire) + 1; } gcc8 -m32 -mno-avx -march=haswell -O3 subl $12, %esp movl 16(%esp), %eax # function arg movq (%eax), %xmm0 # 64-bit atomic load movq %xmm0, (%esp) # gcc's store/reload strategy movl (%esp), %eax movl 4(%esp), %edx addl $1, %eax # a++ in integer regs adcl $0, %edx addl $12, %esp ret It would be cheaper to do a++ with paddq or psubq while we still have the value in xmm0, even counting the cost of generating the constant on the fly. That takes 1 ALU uop for pcmpeqd, and is off the critical path. ADC is 2 uops on Intel CPUs before Broadwell. A lower-latency xmm->int strategy would be: movd %xmm0, %eax pextrd $1, %xmm0, %edx Or without SSE4 -mtune=sandybridge (anything that excluded Nehalem and other CPUs where an FP shuffle has bypass delay between integer ops) movd %xmm0, %eax movshdup %xmm0, %xmm0 # saves 1B of code-size vs. psrldq, I think. movd %xmm0, %edx Or without SSE3, movd %xmm0, %eax psrldq $4, %xmm0 # 1 m-op cheaper than pshufd on K8 movd %xmm0, %edx movd xmm->r32 is efficient on K10 (1 m-op with 3c latency), unlike movd r32->xmm. On Bulldozer-family, it's only 1 m-op, and has 8c latency (or 4 on Steamroller). Store-forwarding latency is high on Bulldozer, so movd %xmm0, %eax / pextrd is probably a win. So avoiding a store/reload is probably a good strategy for -mtune=generic, and -mtune=bdver*. -mtune=k8 should probably store/reload for this direction, too, because movd %xmm0,%eax is 3 m-ops with 2c latency. And it can do 2 loads per clock, and I think store-forwarding both halves of an 8-byte load works. ------ Atomic stores are more of a problem, since two 32b stores can't store-forward to a 64b load (except on Atom). int64_t store64(std::atomic<int64_t> *p, int64_t a) { p->store(a, std::memory_order_release); return a; // tempt gcc into loading into integer regs instead of movq } gcc -m32 -march=sandybridge -O3 emits the following, for gcc4.7 through gcc8-snapshot. (Other than a regression to fildq/fistpq with gcc4.8): subl $12, %esp movl 20(%esp), %eax movl 24(%esp), %edx movl 16(%esp), %ecx # at this point, the int64 and the pointer are in integer regs, like would be typical as part of a real function. # The next three lines are the issue movl %eax, (%esp) movl %edx, 4(%esp) vmovq (%esp), %xmm0 # store-forwarding failure vmovq %xmm0, (%ecx) addl $12, %esp ret If the function is void, then gcc uses a movq load of the function arg (causing a store-forwarding stall if the caller used narrow stores). If it wasn't for that, probably it would be optimal to load it twice. movq 8(%esp), %xmm0 movl 4(%esp), %ecx movl 8(%esp), %eax # return value movl 12(%esp), %edx vmovq %xmm0, (%ecx) ret Loads are cheap: AMD since K8 and Intel since SnB can execute two loads per clock. Unless we bottleneck on load uops (not the same thing as memory bandwidth), other CPUs like Silvermont and Nehalem will probably do well with this, too. (Again, except for the store-forwarding issue from the caller writing the args).