Issue 123728
Summary [AMDGPU][GISel] masking instructions (`and`) not necessary
Labels
Assignees
Reporter qcolombet
    When comparing SDISel with GISel I stumbled on a case where GISel keeps a bunch of `and` instructions to make the values to a `i8` type.
These masks are actually useless because the value is fed to a `truncstore i8` which does the masking itself.
SDISel removes all the redundant bit logic but GISel doesn't.

I haven't dug too much into the details therefore I don't know exactly how the simplification is implemented in SDISel (this is some combines for sure, but which ones I don't know.)

Note that this is likely an issue for all targets hence a generic combiner helper is likely welcome.

# To Reproduce #

Download the attached IR or copy/past the LLVM IR input in the section below.
[repro.ll.txt](https://github.com/user-attachments/files/18489129/repro.ll.txt)

Then run:
```bash
llc -march=amdgcn -mcpu=gfx942  -mtriple amdgcn-amd-hmcsa -global-isel=<0|1>  repro.ll -o -
```

# Results #

GISel ends up with a bunch of bit manipulation operations, in particular `and`s, whereas SDISel doesn't.

With GISel:
```asm
	s_load_dwordx2 s[0:1], s[4:5], 0x24
	v_mov_b32_e32 v2, 8
	v_mov_b32_e32 v0, 0xff
	s_waitcnt lgkmcnt(0)
	s_lshr_b32 s2, s0, 16
	s_lshr_b32 s3, s1, 16
	v_cvt_f32_f16_e32 v1, s2
	v_cvt_f32_f16_e32 v3, s0
	v_cvt_f32_f16_e32 v4, s1
	v_cvt_f32_f16_e32 v5, s3
	v_cvt_i32_f32_e32 v1, v1
	v_cvt_i32_f32_e32 v3, v3
	v_cvt_i32_f32_e32 v4, v4
	v_cvt_i32_f32_e32 v5, v5
	v_and_b32_e32 v6, 0xff, v1 <-- the AND I'm talking about
	v_lshlrev_b32_sdwa v1, v2, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
	v_and_or_b32 v0, v3, v0, v1
	v_and_b32_e32 v1, 0xff, v4 <-- here too
	v_and_b32_e32 v4, 0xff, v5 <-- here too
	v_lshlrev_b32_e32 v1, 16, v1
	v_lshlrev_b32_e32 v4, 24, v4
	v_or3_b32 v4, v0, v1, v4
	v_lshlrev_b16_e32 v0, 8, v6
	v_or_b32_sdwa v0, v3, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
	v_lshrrev_b16_e32 v3, 8, v0
	v_mov_b64_e32 v[0:1], 0
	global_store_byte v[0:1], v4, off
	v_mov_b64_e32 v[0:1], 1
	global_store_byte v[0:1], v3, off
	v_mov_b64_e32 v[0:1], 2
	v_lshrrev_b16_sdwa v2, v2, v4 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:WORD_1
	global_store_byte_d16_hi v[0:1], v4, off
	v_mov_b64_e32 v[0:1], 3
	global_store_byte v[0:1], v2, off
```

With SDISel:
```asm
	s_load_dwordx2 s[0:1], s[4:5], 0x24
	v_mov_b64_e32 v[0:1], 2
	s_waitcnt lgkmcnt(0)
	v_cvt_i16_f16_e32 v3, s1
	s_lshr_b32 s3, s1, 16
	v_cvt_i16_f16_e32 v2, s0
	global_store_byte v[0:1], v3, off
	v_mov_b64_e32 v[0:1], 0
	s_lshr_b32 s2, s0, 16
	v_cvt_i16_f16_e32 v5, s3
	global_store_byte v[0:1], v2, off
	v_mov_b64_e32 v[0:1], 3
	v_cvt_i16_f16_e32 v4, s2
	global_store_byte v[0:1], v5, off
	v_mov_b64_e32 v[0:1], 1
	global_store_byte v[0:1], v4, off
```

# Note #

Input IR:
```llvm
target datalayout = "e-p:64:64-p1:64:64-p2:32:32-p3:32:32-p4:64:64-p5:32:32-p6:32:32-p7:160:256:256:32-p8:128:128-p9:192:256:256:32-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128-v192:256-v256:256-v512:512-v1024:1024-v2048:2048-n32:64-S32-A5-G1-ni:7:8:9"
target triple = "amdgcn-amd-amdhsa"

define amdgpu_kernel void @foo(<4 x half> %i35) {
bb:
  %i90 = fptosi <4 x half> %i35 to <4 x i8>
  store <4 x i8> %i90, ptr addrspace(1) null, align 1
  ret void
}
```
_______________________________________________
llvm-bugs mailing list
llvm-bugs@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs

Reply via email to