Issue |
136519
|
Summary |
[X86] Use `vpinsrq` in building 2-element vector of 64-bit int loads
|
Labels |
new issue
|
Assignees |
|
Reporter |
dzaima
|
For building a two 64-bit element vector, clang currently does separate loads and packs them together, e.g. this code:
```c
typedef uint64_t u64x2 __attribute__((vector_size(16)));
u64x2 generic_int(uint64_t* a, uint64_t* b) {
return (u64x2){*a, *b};
}
__m128i intrinsics(uint64_t* a, uint64_t* b) {
__m128i lo = _mm_loadu_si64(a);
return _mm_insert_epi64(lo, *b, 1);
}
__m128i intrinsics_int_domain(uint64_t* a, uint64_t* b) {
__m128i lo = _mm_loadu_si64(a);
__m128i t = _mm_insert_epi64(lo, *b, 1);
return _mm_add_epi64(t, t);
}
```
via `-O3 -march=haswell` compiles to:
```asm
generic_int:
vmovsd xmm0, qword ptr [rsi]
vmovsd xmm1, qword ptr [rdi]
vmovlhps xmm0, xmm1, xmm0
ret
intrinsics:
vmovsd xmm0, qword ptr [rsi]
vmovsd xmm1, qword ptr [rdi]
vmovlhps xmm0, xmm1, xmm0
ret
intrinsics_int_domain:
vmovq xmm0, qword ptr [rsi]
vmovq xmm1, qword ptr [rdi]
vpunpcklqdq xmm0, xmm1, xmm0
vpaddq xmm0, xmm0, xmm0
ret
```
even though the load of `b` could be done together with the packing via `vpinsrq` for integer domain, and `vmovhps` for unspecified domain if preferring float is desired, i.e.:
```asm
vmovq xmm0, qword ptr [rdi]
vpinsrq xmm0, xmm0, qword ptr [rsi], 1
```
Additionally, per uops.info data, post-icelake, `vpinsrq` has higher throughput than `vmovhps`, and via some local microbenchmarking on Haswell I don't see any domain crossing penalties for either in any direction, so for it could make sense to always use `vpinsrq` and never `vmovhps` (or at least on the applicable targets).
_______________________________________________
llvm-bugs mailing list
llvm-bugs@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs