[Bug target/80833] 32-bit x86 causes store-forwarding stalls for int64_t -> xmm

peter at cordes dot ca Mon, 22 May 2017 13:28:15 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80833


--- Comment #6 from Peter Cordes <peter at cordes dot ca> ---
(In reply to Richard Biener from comment #5)
> There's some related bugs.  I think there is no part of the compiler that
> specifically tries to avoid store forwarding issues.

Ideally the compiler would keep track of how stores were done (or likely done
for code that's not visible), but without that:

For data coming from integer registers, a pure ALU strategy (with movd/q and
punpck or pinsrd/q) should be a win on all CPUs over narrow-store -> wide-load. 
 Except maybe in setup for long loops where small code / fewer uops is a win,
or if there's other work that hides the latency of a store-forwarding stall.

The other exception is maybe -mtune=atom without SSE4.  (But ALU isn't bad
there, so adding a special-case just for old in-order Atom might not make
sense.)

---

For data starting in memory, a simple heuristic might be: vector loads wider
than a single integer reg are good for arrays, or for anything other than
scalar locals / args on the stack that happen to be contiguous.

We need to make sure such a heuristic never results in auto-vectorizing with
movd/pinsrd loads from arrays, instead of movdqu.  However, it might be
appropriate to use a movd/pinsrd strategy for _mm_set_epi32, even if the data
happens to be contiguous in memory.  In that case, the programmer can use a
_mm_loadu_si128 (and a struct or array to ensure adjacency).

It's less clear what to do about int64_t in 32-bit mode, though, without a good
mechanism to track how it was recently written.  Always using movd/pinsrd for
locals / args is not horrible, but would suck for structs in memory if the
programmer is assuming that they'll get an efficient MOVQ/MOVHPS.


A function that takes a read-write int64_t *arg might often get called right
after the pointed-to data is written.  In 32-bit code, we need it in integer
registers to do anything but copy it.  If we're just copying it somewhere else,
hopefully a store-forwarding stall isn't part of the critical path.  I'm not
sure how long it takes for a store to complete, and no longer need to be
forwarded.  The store buffer can't commit stores to L1 until they retire (and
then it has to go in-order to preserve x86 memory ordering), so even passing a
pointer on the stack (store/reload with successful forwarding) probably isn't
nearly enough latency for a pair of stores in the caller to be actually
committed to L1.

A store-forwarding "stall" doesn't actually stall the whole pipeline, or even
unrelated memory ops, AFAIK.  My understanding is that it just adds latency to
the affected load while out-of-order execution continues as normal.  There may
be some throughput limitations on how many failed-store-forwarding loads can be
in flight at once: I think it works by scanning the store buffer for all
overlapping stores, if the last store that wrote any of the bytes isn't able to
use the forwarding fast-case (either because of sub-alignment restrictions or
only partial overlap).  It doesn't have to drain the store buffer, though.

Obviously every uarch can have its own quirks, but this seems the most likely
explanation for a latency penalty that's a constant number of cycles.

AFAIK, the store-forwarding stall penalty can't start until the load-address is
ready, since AFAIK no major x86 CPUs do address-prediction for loads.  So the 6
+ 10c latency for an SSE load on SnB with failed store-forwarding would be from
when the address becomes ready to when the value becomes ready.  I might be
mistaken, though.  Maybe it helps if the store executed several cycles before
the load-address was ready, so 32-bit code using a MOVQ xmm load on an int64_t*
won't suffer as badly if it got the address from a stack arg, and did some
other work before the load.

[Bug target/80833] 32-bit x86 causes store-forwarding stalls for int64_t -> xmm

Reply via email to