https://bugs.llvm.org/show_bug.cgi?id=41512
Bug ID: 41512
Summary: Conversion from int to XMM is handled inefficiently on
SSE4
Product: libraries
Version: trunk
Hardware: PC
OS: Windows NT
Status: NEW
Severity: enhancement
Priority: P
Component: Backend: X86
Assignee: unassignedb...@nondot.org
Reporter: spr...@yandex-team.ru
CC: craig.top...@gmail.com, llvm-bugs@lists.llvm.org,
llvm-...@redking.me.uk, spatel+l...@rotateright.com
Created attachment 21786
--> https://bugs.llvm.org/attachment.cgi?id=21786&action=edit
Proposed fix
In attempt to swicth all our builds to SSE4 from SSSE3 we found out that code
as simple as
const __m128i lo = _mm_cvtsi32_si128(d0[value]);
const __m128i hi = _mm_cvtsi32_si128(d0[value+1024]);
val = _mm_add_epi64(val, _mm_unpacklo_epi64(lo, hi));
or
const __m128i all = _mm_set_epi32(0, d0[value], 0, d0[value+1024]);
val = _mm_add_epi64(val, all);
When inlined into loop performs worse when compiled with -sse4.1 than with just
-ssse3.
The problem is that _mm_cvtsi32_si128() and _mm_set_epi32() both modeled via
INSERT_VECTOR_ELT, and
%13 = insertelement <4 x i32> <i32 undef, i32 0, i32 undef, i32 0>, i32 %12,
i32 0, !dbg !287
Lowered to single movd instruction prior to SSE4 and to xor+pinsrd on SSE4.
https://gcc.godbolt.org/z/qY8nkO
* Notice that in a kernel fucntion in 2nd case there are couple of movd's, but
when used in loop it results in pair of pinsrd from memory into same register.
This seems to me like poor instruction selection both from performance and code
size standpopints.
I suggset steering instruction selection for this idiomatic case of
INSERT_VECTOR_ELT to SCALAR_TO_VECTOR. This will directly lead to movd
emission.
Proposed change to lib/Target/X86/X86ISelLowering.cpp is attached.
--
You are receiving this mail because:
You are on the CC list for the bug.
_______________________________________________
llvm-bugs mailing list
llvm-bugs@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs