https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80833

            Bug ID: 80833
           Summary: 32-bit x86 causes store-forwarding stalls for int64_t
                    -> xmm
           Product: gcc
           Version: 8.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: peter at cordes dot ca
  Target Milestone: ---
            Target: i?86-*-*

This affects 64-bit atomic loads/stores, as well as _mm_set_epi64x intrinsics.

gcc -m32 copies int64_t data into xmm registers with the worst possible
strategy: two 32-bit stores and a 64-bit vector load, causing a
store-forwarding stall.

Similarly, for getting 64-bit integers back out of xmm registers, gcc's
store/reload strategy may not be optimal on Intel CPUs.  (But doesn't cause a
store-forwarding stall).


#include <immintrin.h>
__m128i combine64(int64_t a, int64_t b) {
  return _mm_set_epi64x(b, a|1);
}

b is loaded directly from memory, but the low half of `a` is modified so it
can't be, letting us observe gcc's int64->xmm strategy.

gcc8-snapshot -m32 -march=haswell -mno-avx -O3 emits:

        subl    $28, %esp
        movl    32(%esp), %eax
        movl    36(%esp), %edx   # the untouched upper half of `a`
        orl     $1, %eax

        # `a` is in integer regs.  The next three lines are gcc's typical
pattern for int64 -> xmm
        movl    %edx, 12(%esp)
        movl    %eax, 8(%esp)
        movq   8(%esp), %xmm0    # guaranteed store-forwarding stall, except
with -mtune=atom

        movhps 40(%esp), %xmm0   # store-forwarding stall if the caller used
scalar stores.
        addl    $28, %esp
        ret

A slight improvement would be to  orl $1, 4(%esp)  to do a|1 in-place, instead
of copying it.  But that still has an extra load-store round-trip, so it's not
good even on Atom where it wouldn't cause a store-forwarding stall.

-----

For data coming from integer registers, clearly we should use whatever strategy
is optimal for _mm_set_epi32, as discussed in
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80820).  movd / pinsrd should be
good for -mtune=intel, and Ryzen.

(But in a case like this, with one simple integer operation on data starting in
memory, avoiding integer registers entirely by doing that operation with a
vector-integer instruction is worth considering, see below).

----

For integers that are coming from memory, it matters a lot whether we expect
them to have been written recently with narrow stores.  Except with
-mtune=atom, which doesn't stall for narrow stores -> wide load.

e.g. -mtune=atom could use

        mov        $1, %eax
        movd     %eax, %xmm0   # or load the vector constant from memory
        por   4(%esp), %xmm0
        ret

If the two integer aren't adjacent, movq / movhps is good.  Other CPUs can use
this, too, if reading memory that we expect wasn't recently written with narrow
stores.

Doing 64-bit integer ops in vector regs is an even bigger win for ops with
carry from low to high, like shift or add.  For example, if it was `a++`
instead of `a|1`:

        movq    4(%esp), %xmm0   # load `a`
        pcmpeqd %xmm1,%xmm1
        psubq   %xmm1,%xmm0      # a -= -1.  1 uop vs. 3 for add/adc on Haswell
        movhps  12(%esp), %xmm0  # merge in `b`
        ret

If we were worried about store-forwarding stalls, then get `a` into xmm0 in two
halves before the psubq.

Scalar 64-bit integer ops in vector regs may be useful in general in 32-bit
code in some cases, especially if it helps with register pressure.

----

For function args, especially near the start of a function, we should assume
that anything other than separate 32-bit loads will cause a store-forwarding
stall, incurring an extra 10 to 12 cycles of latency beyond the usual
store-forwarding latency, depending on uarch.  i.e. very expensive, and worth
spending a lot of instructions to avoid if it's at all likely, maybe even if
not part of a long dep chain (since out-of-order execution may have trouble
hiding that much latency).

clang4.0 -O3 -march=sandybridge -mno-avx emits this, which is very good if we
have to assume that the function args were recently stored in 32-bit chunks:

        movl    4(%esp), %eax
        orl     $1, %eax
        movd    %eax, %xmm0
        pinsrd  $1, 8(%esp), %xmm0   # taking the unmodified half of `a`
directly from the original memory location is a good optimization
        pinsrd  $2, 12(%esp), %xmm0
        pinsrd  $3, 16(%esp), %xmm0
        retl

pinsrd with a memory source is 2 fused-domain uops, but only one of them is an
ALU uop (for the shuffle port).  The other is the load.  It never micro-fuses.

gcc's usual _mm_set_epi32 strategy of doing two 64-bit halves and merging with
a shuffle would also work well with memory source data.



For CPUs where int->xmm is not fast, doing the OR (or whatever other ALU
operation) with a vector instruction is even more attractive than on Intel,
even if we still have to load 32 bits at a time.

It's also good on Intel CPUs if we can hoist the vector constant out of a loop,
since int->xmm needs a port5 uop which competes with shuffles, especially on
Haswell and later that only have 1 shuffle unit.


        movd    4(%esp), %xmm0       # a (low half)

        movd    12(%esp), %xmm2      # b
        pinsrd  $1, 16(%esp), %xmm2

        pcmpeqd %xmm1,%xmm1
        psrld   $31, %xmm1           # or load a constant from memory
        por     %xmm1, %xmm0         # a |= 1

        pinsrd  $1,  8(%esp), %xmm0  # then merge the high half of `a`,
replacing the garbage in element 1

        punpcklqdq %xmm2,%xmm0
        retl

This only has 3 port5 uops, and has a latency on haswell of 3 cycles from the
first 2 loads being ready.  Since out-of-order CPUs typically run uops in
oldest-ready order (with adjustments for higher-latency uops to avoid writeback
conflicts), I scheduled this so the two movd loads are first, allowing the
three port5 uops to run in three consecutive cycles.  (First the pinsrd of the
high half of b, since the two load uops from the pinsrd instructions should be
ready the cycle after the two movd uops.)  I have no idea if this really would
help avoid extra resource-conflict latency for the critical path, but it can't
hurt.



----------


This also affects 64-bit atomic stores and loads.

#include <atomic>
#include <stdint.h>

int64_t load64(std::atomic<int64_t> *p) {
  return p->load(std::memory_order_acquire) + 1;
}

gcc8 -m32 -mno-avx -march=haswell -O3

        subl    $12, %esp
        movl    16(%esp), %eax   # function arg
        movq    (%eax), %xmm0    # 64-bit atomic load

        movq    %xmm0, (%esp)    # gcc's store/reload strategy
        movl    (%esp), %eax
        movl    4(%esp), %edx

        addl    $1, %eax         # a++ in integer regs
        adcl    $0, %edx
        addl    $12, %esp
        ret

It would be cheaper to do a++ with paddq or psubq while we still have the value
in xmm0, even counting the cost of generating the constant on the fly.  That
takes 1 ALU uop for pcmpeqd, and is off the critical path.  ADC is 2 uops on
Intel CPUs before Broadwell.

A lower-latency xmm->int strategy would be:

        movd    %xmm0, %eax
        pextrd  $1, %xmm0, %edx

Or without SSE4 -mtune=sandybridge (anything that excluded Nehalem and other
CPUs where an FP shuffle has bypass delay between integer ops)

        movd     %xmm0, %eax
        movshdup %xmm0, %xmm0  # saves 1B of code-size vs. psrldq, I think.
        movd     %xmm0, %edx

Or without SSE3,

        movd     %xmm0, %eax
        psrldq   $4,  %xmm0    # 1 m-op cheaper than pshufd on K8
        movd     %xmm0, %edx


movd xmm->r32 is efficient on K10 (1 m-op with 3c latency), unlike movd
r32->xmm.

On Bulldozer-family, it's only 1 m-op, and has 8c latency (or 4 on
Steamroller).  Store-forwarding latency is high on Bulldozer, so movd %xmm0,
%eax / pextrd is probably a win.

So avoiding a store/reload is probably a good strategy for -mtune=generic, and
-mtune=bdver*.

-mtune=k8 should probably store/reload for this direction, too, because movd
%xmm0,%eax is 3 m-ops with 2c latency.  And it can do 2 loads per clock, and I
think store-forwarding both halves of an 8-byte load works.


------

Atomic stores are more of a problem, since two 32b stores can't store-forward
to a 64b load (except on Atom).


int64_t store64(std::atomic<int64_t> *p, int64_t a) {
  p->store(a, std::memory_order_release);
  return a;  // tempt gcc into loading into integer regs instead of movq
}

gcc -m32 -march=sandybridge -O3  emits the following, for gcc4.7 through
gcc8-snapshot.  (Other than a regression to fildq/fistpq with gcc4.8):

        subl    $12, %esp
        movl    20(%esp), %eax
        movl    24(%esp), %edx
        movl    16(%esp), %ecx

        # at this point, the int64 and the pointer are in integer regs, like
would be typical as part of a real function.
        # The next three lines are the issue
        movl    %eax, (%esp)
        movl    %edx, 4(%esp)
        vmovq   (%esp), %xmm0   # store-forwarding failure

        vmovq   %xmm0, (%ecx)
        addl    $12, %esp
        ret

If the function is void, then gcc uses a movq load of the function arg (causing
a store-forwarding stall if the caller used narrow stores).

If it wasn't for that, probably it would be optimal to load it twice.

        movq    8(%esp), %xmm0
        movl    4(%esp), %ecx

        movl    8(%esp), %eax   # return value
        movl    12(%esp), %edx

        vmovq   %xmm0, (%ecx)
        ret

Loads are cheap: AMD since K8 and Intel since SnB can execute two loads per
clock.  Unless we bottleneck on load uops (not the same thing as memory
bandwidth), other CPUs like Silvermont and Nehalem will probably do well with
this, too.  (Again, except for the store-forwarding issue from the caller
writing the args).

Reply via email to