Issue 136519
Summary [X86] Use `vpinsrq` in building 2-element vector of 64-bit int loads
Labels new issue
Assignees
Reporter dzaima
    For building a two 64-bit element vector, clang currently does separate loads and packs them together, e.g. this code:

```c
typedef uint64_t u64x2 __attribute__((vector_size(16)));
u64x2 generic_int(uint64_t* a, uint64_t* b) {
    return (u64x2){*a, *b};
}

__m128i intrinsics(uint64_t* a, uint64_t* b) {
    __m128i lo = _mm_loadu_si64(a);
    return _mm_insert_epi64(lo, *b, 1);
}

__m128i intrinsics_int_domain(uint64_t* a, uint64_t* b) {
    __m128i lo = _mm_loadu_si64(a);
    __m128i t = _mm_insert_epi64(lo, *b, 1);
    return _mm_add_epi64(t, t);
}
```
via `-O3 -march=haswell` compiles to:
```asm
generic_int:
        vmovsd xmm0, qword ptr [rsi]
        vmovsd  xmm1, qword ptr [rdi]
 vmovlhps        xmm0, xmm1, xmm0
        ret

intrinsics:
        vmovsd xmm0, qword ptr [rsi]
        vmovsd  xmm1, qword ptr [rdi]
 vmovlhps        xmm0, xmm1, xmm0
        ret

intrinsics_int_domain:
 vmovq   xmm0, qword ptr [rsi]
        vmovq   xmm1, qword ptr [rdi]
 vpunpcklqdq     xmm0, xmm1, xmm0
        vpaddq  xmm0, xmm0, xmm0
 ret
```

even though the load of `b` could be done together with the packing via `vpinsrq` for integer domain, and `vmovhps` for unspecified domain if preferring float is desired, i.e.:
```asm
vmovq  xmm0, qword ptr [rdi]
vpinsrq xmm0, xmm0, qword ptr [rsi], 1
```

Additionally, per uops.info data, post-icelake, `vpinsrq` has higher throughput than `vmovhps`, and via some local microbenchmarking on Haswell I don't see any domain crossing penalties for either in any direction, so for it could make sense to always use `vpinsrq` and never `vmovhps` (or at least on the applicable targets).
_______________________________________________
llvm-bugs mailing list
llvm-bugs@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs

Reply via email to