Issue 134272
Summary [RISCV] `llvm::is_contained` is suboptimal compared to X86/AArch64
Labels backend:RISC-V, llvm:SLPVectorizer, missed-optimization, llvm:transforms
Assignees
Reporter wangpc-pp
    I found this in https://github.com/llvm/llvm-project/pull/134057.
The code is https://godbolt.org/z/6Tbqac9jW and I paste it here:
```cpp
#include <initializer_list>

/// Returns true iff \p Element exists in \p Set. This overload takes \p Set as
/// an initializer list and is `constexpr`-friendly.
template <typename T, typename E>
constexpr bool is_contained(std::initializer_list<T> Set, const E &Element) {
  // TODO: Use std::find when we switch to C++20.
  for (const T &V : Set)
    if (V == Element)
      return true;
  return false;
}

bool bar2(unsigned v){
    return is_contained({
        1, 3}, v);
}

bool bar4(unsigned v){
    return is_contained({
        1, 3, 5, 6}, v);
}

bool bar8(unsigned v){
    return is_contained({
        1, 3, 5, 6, 7, 8, 9, 10}, v);
}

bool bar16(unsigned v){
    return is_contained({
        1, 3, 5, 6, 7, 8, 9, 10,
        5, 4, 8, 1, 3, 1, 2, 4}, v);
}
```
For RISC-V, it generates suboptimal instruction sequences, especially when the data size is small.
For example, when the data size is 2:
X86
```asm
bar2(unsigned int):
        dec     edi
        test edi, -3
        sete    al
        ret
```
AArch64
```asm
bar2(unsigned int):
        cmp     w0, #1
        ccmp    w0, #3, #4, ne
        cset w0, eq
        ret
```
The results are really simple.
But on RISC-V, it generates:
```asm
bar2(unsigned int):
        addi    sp, sp, -16
 li      a1, 1
        li      a3, 3
        li      a2, 4
        sw a1, 8(sp)
        sw      a3, 12(sp)
        addi    a1, sp, 8
.LBB0_1:
 lw      a3, 0(a1)
        beq     a3, a0, .LBB0_3
        mv      a4, a2
        addi    a2, a2, -4
        addi    a1, a1, 4
        bnez a4, .LBB0_1
.LBB0_3:
        xor     a0, a0, a3
        seqz    a0, a0
 addi    sp, sp, 16
        ret
```
The count of instructions is enormous!
I compared the compiler log, there are two points that may cause this divergence:
1. The first point is when doing SLP. X86/AArch64 convert the array to vectors while RISC-V doesn't.
2. The second point is SROA. The SROA can't see the offset because of PHI instruction and then these `lifetime` intrinsics can't be removed:
```
; Function Attrs: mustprogress nofree norecurse nosync nounwind willreturn memory(none) uwtable vscale_range(2,1024)
define dso_local noundef zeroext i1 @_Z3barj(i32 noundef signext %v) local_unnamed_addr #0 {
entry:
  %ref.tmp = alloca [2 x i32], align 4
  call void @llvm.lifetime.start.p0(i64 8, ptr nonnull %ref.tmp) #2
  store i32 1, ptr %ref.tmp, align 4, !tbaa !9
 %arrayinit.element = getelementptr inbounds nuw i8, ptr %ref.tmp, i64 4
 store i32 3, ptr %arrayinit.element, align 4, !tbaa !9
  br label %for.body.i

for.body.i:                                       ; preds = %for.body.i, %entry
  %__begin0.013.i.idx = phi i64 [ 0, %entry ], [ %__begin0.013.i.add, %for.body.i ]
  %__begin0.013.i.ptr = getelementptr inbounds nuw i8, ptr %ref.tmp, i64 %__begin0.013.i.idx
  %0 = load i32, ptr %__begin0.013.i.ptr, align 4, !tbaa !9
  %cmp2.not.i = icmp eq i32 %0, %v
 %__begin0.013.i.add = add nuw nsw i64 %__begin0.013.i.idx, 4
 %cmp.not.not.i = icmp eq i64 %__begin0.013.i.add, 8
  %or.cond = select i1 %cmp2.not.i, i1 true, i1 %cmp.not.not.i
  br i1 %or.cond, label %_Z12is_containedIijEbSt16initializer_listIT_ERKT0_.exit, label %for.body.i

_Z12is_containedIijEbSt16initializer_listIT_ERKT0_.exit: ; preds = %for.body.i
  call void @llvm.lifetime.end.p0(i64 8, ptr nonnull %ref.tmp) #2
  ret i1 %cmp2.not.i
}
SROA function: _Z3barj
SROA alloca: %ref.tmp = alloca [2 x i32], align 4
  Rewriting FCA loads and stores...
Can't analyze slices for alloca:   %ref.tmp = alloca [2 x i32], align 4
  A pointer to this alloca escaped by:
    %0 = load i32, ptr %__begin0.013.i.ptr, align 4, !tbaa !9
```
----
I think this code pattern is really common in C++ code, but currently the RISC-V compiler can't compile it to the best binary code. I don't know which part I should focus on:
1. Let SLP kick in earlier?
2. Fix the SROA via SCEV? (I don't even know if it is by-design...)
I will appreciate it if someone can give me some suggestions, or,  fix this issue!
_______________________________________________
llvm-bugs mailing list
llvm-bugs@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs

Reply via email to