Issue |
134272
|
Summary |
[RISCV] `llvm::is_contained` is suboptimal compared to X86/AArch64
|
Labels |
backend:RISC-V,
llvm:SLPVectorizer,
missed-optimization,
llvm:transforms
|
Assignees |
|
Reporter |
wangpc-pp
|
I found this in https://github.com/llvm/llvm-project/pull/134057.
The code is https://godbolt.org/z/6Tbqac9jW and I paste it here:
```cpp
#include <initializer_list>
/// Returns true iff \p Element exists in \p Set. This overload takes \p Set as
/// an initializer list and is `constexpr`-friendly.
template <typename T, typename E>
constexpr bool is_contained(std::initializer_list<T> Set, const E &Element) {
// TODO: Use std::find when we switch to C++20.
for (const T &V : Set)
if (V == Element)
return true;
return false;
}
bool bar2(unsigned v){
return is_contained({
1, 3}, v);
}
bool bar4(unsigned v){
return is_contained({
1, 3, 5, 6}, v);
}
bool bar8(unsigned v){
return is_contained({
1, 3, 5, 6, 7, 8, 9, 10}, v);
}
bool bar16(unsigned v){
return is_contained({
1, 3, 5, 6, 7, 8, 9, 10,
5, 4, 8, 1, 3, 1, 2, 4}, v);
}
```
For RISC-V, it generates suboptimal instruction sequences, especially when the data size is small.
For example, when the data size is 2:
X86
```asm
bar2(unsigned int):
dec edi
test edi, -3
sete al
ret
```
AArch64
```asm
bar2(unsigned int):
cmp w0, #1
ccmp w0, #3, #4, ne
cset w0, eq
ret
```
The results are really simple.
But on RISC-V, it generates:
```asm
bar2(unsigned int):
addi sp, sp, -16
li a1, 1
li a3, 3
li a2, 4
sw a1, 8(sp)
sw a3, 12(sp)
addi a1, sp, 8
.LBB0_1:
lw a3, 0(a1)
beq a3, a0, .LBB0_3
mv a4, a2
addi a2, a2, -4
addi a1, a1, 4
bnez a4, .LBB0_1
.LBB0_3:
xor a0, a0, a3
seqz a0, a0
addi sp, sp, 16
ret
```
The count of instructions is enormous!
I compared the compiler log, there are two points that may cause this divergence:
1. The first point is when doing SLP. X86/AArch64 convert the array to vectors while RISC-V doesn't.
2. The second point is SROA. The SROA can't see the offset because of PHI instruction and then these `lifetime` intrinsics can't be removed:
```
; Function Attrs: mustprogress nofree norecurse nosync nounwind willreturn memory(none) uwtable vscale_range(2,1024)
define dso_local noundef zeroext i1 @_Z3barj(i32 noundef signext %v) local_unnamed_addr #0 {
entry:
%ref.tmp = alloca [2 x i32], align 4
call void @llvm.lifetime.start.p0(i64 8, ptr nonnull %ref.tmp) #2
store i32 1, ptr %ref.tmp, align 4, !tbaa !9
%arrayinit.element = getelementptr inbounds nuw i8, ptr %ref.tmp, i64 4
store i32 3, ptr %arrayinit.element, align 4, !tbaa !9
br label %for.body.i
for.body.i: ; preds = %for.body.i, %entry
%__begin0.013.i.idx = phi i64 [ 0, %entry ], [ %__begin0.013.i.add, %for.body.i ]
%__begin0.013.i.ptr = getelementptr inbounds nuw i8, ptr %ref.tmp, i64 %__begin0.013.i.idx
%0 = load i32, ptr %__begin0.013.i.ptr, align 4, !tbaa !9
%cmp2.not.i = icmp eq i32 %0, %v
%__begin0.013.i.add = add nuw nsw i64 %__begin0.013.i.idx, 4
%cmp.not.not.i = icmp eq i64 %__begin0.013.i.add, 8
%or.cond = select i1 %cmp2.not.i, i1 true, i1 %cmp.not.not.i
br i1 %or.cond, label %_Z12is_containedIijEbSt16initializer_listIT_ERKT0_.exit, label %for.body.i
_Z12is_containedIijEbSt16initializer_listIT_ERKT0_.exit: ; preds = %for.body.i
call void @llvm.lifetime.end.p0(i64 8, ptr nonnull %ref.tmp) #2
ret i1 %cmp2.not.i
}
SROA function: _Z3barj
SROA alloca: %ref.tmp = alloca [2 x i32], align 4
Rewriting FCA loads and stores...
Can't analyze slices for alloca: %ref.tmp = alloca [2 x i32], align 4
A pointer to this alloca escaped by:
%0 = load i32, ptr %__begin0.013.i.ptr, align 4, !tbaa !9
```
----
I think this code pattern is really common in C++ code, but currently the RISC-V compiler can't compile it to the best binary code. I don't know which part I should focus on:
1. Let SLP kick in earlier?
2. Fix the SROA via SCEV? (I don't even know if it is by-design...)
I will appreciate it if someone can give me some suggestions, or, fix this issue!
_______________________________________________
llvm-bugs mailing list
llvm-bugs@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs