Issue |
Summary |
[AMDGPU][GISel] masking instructions (`and`) not necessary
Labels |
Assignees |
Reporter |
When comparing SDISel with GISel I stumbled on a case where GISel keeps a bunch of `and` instructions to make the values to a `i8` type.
These masks are actually useless because the value is fed to a `truncstore i8` which does the masking itself.
SDISel removes all the redundant bit logic but GISel doesn't.
I haven't dug too much into the details therefore I don't know exactly how the simplification is implemented in SDISel (this is some combines for sure, but which ones I don't know.)
Note that this is likely an issue for all targets hence a generic combiner helper is likely welcome.
# To Reproduce #
Download the attached IR or copy/past the LLVM IR input in the section below.
Then run:
llc -march=amdgcn -mcpu=gfx942 -mtriple amdgcn-amd-hmcsa -global-isel=<0|1> repro.ll -o -
# Results #
GISel ends up with a bunch of bit manipulation operations, in particular `and`s, whereas SDISel doesn't.
With GISel:
s_load_dwordx2 s[0:1], s[4:5], 0x24
v_mov_b32_e32 v2, 8
v_mov_b32_e32 v0, 0xff
s_waitcnt lgkmcnt(0)
s_lshr_b32 s2, s0, 16
s_lshr_b32 s3, s1, 16
v_cvt_f32_f16_e32 v1, s2
v_cvt_f32_f16_e32 v3, s0
v_cvt_f32_f16_e32 v4, s1
v_cvt_f32_f16_e32 v5, s3
v_cvt_i32_f32_e32 v1, v1
v_cvt_i32_f32_e32 v3, v3
v_cvt_i32_f32_e32 v4, v4
v_cvt_i32_f32_e32 v5, v5
v_and_b32_e32 v6, 0xff, v1 <-- the AND I'm talking about
v_lshlrev_b32_sdwa v1, v2, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
v_and_or_b32 v0, v3, v0, v1
v_and_b32_e32 v1, 0xff, v4 <-- here too
v_and_b32_e32 v4, 0xff, v5 <-- here too
v_lshlrev_b32_e32 v1, 16, v1
v_lshlrev_b32_e32 v4, 24, v4
v_or3_b32 v4, v0, v1, v4
v_lshlrev_b16_e32 v0, 8, v6
v_or_b32_sdwa v0, v3, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
v_lshrrev_b16_e32 v3, 8, v0
v_mov_b64_e32 v[0:1], 0
global_store_byte v[0:1], v4, off
v_mov_b64_e32 v[0:1], 1
global_store_byte v[0:1], v3, off
v_mov_b64_e32 v[0:1], 2
v_lshrrev_b16_sdwa v2, v2, v4 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:WORD_1
global_store_byte_d16_hi v[0:1], v4, off
v_mov_b64_e32 v[0:1], 3
global_store_byte v[0:1], v2, off
With SDISel:
s_load_dwordx2 s[0:1], s[4:5], 0x24
v_mov_b64_e32 v[0:1], 2
s_waitcnt lgkmcnt(0)
v_cvt_i16_f16_e32 v3, s1
s_lshr_b32 s3, s1, 16
v_cvt_i16_f16_e32 v2, s0
global_store_byte v[0:1], v3, off
v_mov_b64_e32 v[0:1], 0
s_lshr_b32 s2, s0, 16
v_cvt_i16_f16_e32 v5, s3
global_store_byte v[0:1], v2, off
v_mov_b64_e32 v[0:1], 3
v_cvt_i16_f16_e32 v4, s2
global_store_byte v[0:1], v5, off
v_mov_b64_e32 v[0:1], 1
global_store_byte v[0:1], v4, off
# Note #
Input IR:
target datalayout = "e-p:64:64-p1:64:64-p2:32:32-p3:32:32-p4:64:64-p5:32:32-p6:32:32-p7:160:256:256:32-p8:128:128-p9:192:256:256:32-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128-v192:256-v256:256-v512:512-v1024:1024-v2048:2048-n32:64-S32-A5-G1-ni:7:8:9"
target triple = "amdgcn-amd-amdhsa"
define amdgpu_kernel void @foo(<4 x half> %i35) {
%i90 = fptosi <4 x half> %i35 to <4 x i8>
store <4 x i8> %i90, ptr addrspace(1) null, align 1
ret void
llvm-bugs mailing list