On Fri, Aug 05, 2022 at 03:04:34PM -0700, Andres Freund wrote: > But mainly I'd expect to find a difference if the SIMD code were optimized a > further on the basis of not needing to return the offset. E.g. by > replacing _mm_packs_epi32 with _mm_or_si128, that's cheaper.
I haven't been able to find a significant difference between the two. If anything, the _mm_packs_epi* approach actually seems to be slightly faster in some cases. For something marginally more concrete, I compared the two in perf-top and saw the following for the relevant instructions: _mm_packs_epi*: 0.19 │ packssdw %xmm1,%xmm0 0.62 │ packssdw %xmm1,%xmm0 7.14 │ packsswb %xmm1,%xmm0 _mm_or_si128: 1.52 │ por %xmm1,%xmm0 2.05 │ por %xmm1,%xmm0 5.66 │ por %xmm1,%xmm0 I also tried a combined approach where I replaced _mm_packs_epi16 with _mm_or_si128: 1.16 │ packssdw %xmm1,%xmm0 1.47 │ packssdw %xmm1,%xmm0 8.17 │ por %xmm1,%xmm0 Of course, this simplistic analysis leaves out the impact of the surrounding instructions, but it seems to support the idea that the _mm_packs_epi* approach might have a slight edge. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com