https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80833
--- Comment #6 from Peter Cordes <peter at cordes dot ca> --- (In reply to Richard Biener from comment #5) > There's some related bugs. I think there is no part of the compiler that > specifically tries to avoid store forwarding issues. Ideally the compiler would keep track of how stores were done (or likely done for code that's not visible), but without that: For data coming from integer registers, a pure ALU strategy (with movd/q and punpck or pinsrd/q) should be a win on all CPUs over narrow-store -> wide-load. Except maybe in setup for long loops where small code / fewer uops is a win, or if there's other work that hides the latency of a store-forwarding stall. The other exception is maybe -mtune=atom without SSE4. (But ALU isn't bad there, so adding a special-case just for old in-order Atom might not make sense.) --- For data starting in memory, a simple heuristic might be: vector loads wider than a single integer reg are good for arrays, or for anything other than scalar locals / args on the stack that happen to be contiguous. We need to make sure such a heuristic never results in auto-vectorizing with movd/pinsrd loads from arrays, instead of movdqu. However, it might be appropriate to use a movd/pinsrd strategy for _mm_set_epi32, even if the data happens to be contiguous in memory. In that case, the programmer can use a _mm_loadu_si128 (and a struct or array to ensure adjacency). It's less clear what to do about int64_t in 32-bit mode, though, without a good mechanism to track how it was recently written. Always using movd/pinsrd for locals / args is not horrible, but would suck for structs in memory if the programmer is assuming that they'll get an efficient MOVQ/MOVHPS. A function that takes a read-write int64_t *arg might often get called right after the pointed-to data is written. In 32-bit code, we need it in integer registers to do anything but copy it. If we're just copying it somewhere else, hopefully a store-forwarding stall isn't part of the critical path. I'm not sure how long it takes for a store to complete, and no longer need to be forwarded. The store buffer can't commit stores to L1 until they retire (and then it has to go in-order to preserve x86 memory ordering), so even passing a pointer on the stack (store/reload with successful forwarding) probably isn't nearly enough latency for a pair of stores in the caller to be actually committed to L1. A store-forwarding "stall" doesn't actually stall the whole pipeline, or even unrelated memory ops, AFAIK. My understanding is that it just adds latency to the affected load while out-of-order execution continues as normal. There may be some throughput limitations on how many failed-store-forwarding loads can be in flight at once: I think it works by scanning the store buffer for all overlapping stores, if the last store that wrote any of the bytes isn't able to use the forwarding fast-case (either because of sub-alignment restrictions or only partial overlap). It doesn't have to drain the store buffer, though. Obviously every uarch can have its own quirks, but this seems the most likely explanation for a latency penalty that's a constant number of cycles. AFAIK, the store-forwarding stall penalty can't start until the load-address is ready, since AFAIK no major x86 CPUs do address-prediction for loads. So the 6 + 10c latency for an SSE load on SnB with failed store-forwarding would be from when the address becomes ready to when the value becomes ready. I might be mistaken, though. Maybe it helps if the store executed several cycles before the load-address was ready, so 32-bit code using a MOVQ xmm load on an int64_t* won't suffer as badly if it got the address from a stack arg, and did some other work before the load.