[Bug target/80833] 32-bit x86 causes store-forwarding stalls for int64_t -> xmm

peter at cordes dot ca Sat, 09 Jun 2018 18:03:39 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80833


--- Comment #14 from Peter Cordes <peter at cordes dot ca> ---
I happened to look at this old bug again recently.

re: extracting high the low two 32-bit elements:

(In reply to Uroš Bizjak from comment #11)
> > Or without SSE4 -mtune=sandybridge (anything that excluded Nehalem and other
> > CPUs where an FP shuffle has bypass delay between integer ops)
> > 
> >         movd     %xmm0, %eax
> >         movshdup %xmm0, %xmm0  # saves 1B of code-size vs. psrldq, I think.
> >         movd     %xmm0, %edx
> > 
> > Or without SSE3,
> > 
> >         movd     %xmm0, %eax
> >         psrldq   $4,  %xmm0    # 1 m-op cheaper than pshufd on K8
> >         movd     %xmm0, %edx
> 
> The above two proposals are not suitable for generic moves. We should not
> clobber input value, and we are not allowed to use temporary.

SSE3 movshdup broadcasts the high element within each pair of 32-bit elements
so 

   movshdup  %xmm0, %xmm1
   movd      %xmm1, %eax

saves a byte of code vs  pshufd / movd, and saves a uop on Merom and avoids a
flt->int.  (According to Agner Fog's tables, pshufd is flt->int domain, i.e. it
wants input in the float domain.  While movshdup ironically is only an integer
shuffle.)

Probably not worth looking for that optimization, though, because it's not
worth using universally (Nehalem has worse latency for float shuffles between
int instructions).


With just SSE2, PSHUFLW is the same size as PSHUFD and faster on Merom / K8
(slowshuffle CPUs where PSHUFD is multiple uops).  It's not slower on any
current CPUs.  I could imagine some future CPU having better throughput for
32-bit element size shuffles than 16-bit, though.  That's already the case for
wider lane-crossing shuffles (VPERMW YMM is multiple uops on Skylake-AVX512). 
This would be a definite win for tune=core2 or k8, and Pentium M, but those are
so old it's probably not worth adding extra code to look for it.

I think it's pretty future-proof, though, unless Intel or AMD add an extra
shuffle unit for element sizes of 32-bit or wider on another port.

[Bug target/80833] 32-bit x86 causes store-forwarding stalls for int64_t -> xmm

Reply via email to