On Wed, Mar 11, 2015 at 12:09 AM, Tom Stellard <t...@stellard.net> wrote: > On Tue, Mar 10, 2015 at 11:01:21PM +0100, Marek Olšák wrote: >> I've looked into how to recognize BFM and BFI and discovered that if >> TGSI_OPCODE_BFI is expanded, it's _impossible_ to recognize the >> pattern in the backend due to LLVM transformations. The reason it's >> impossible is that one particular simplification of the expanded IR >> can always be done and it always changes the IR in a way that BFI >> can't be recognized anymore. >> >> The ideal transformation from TGSI to ISA is (note that this is also >> how GLSL defines the opcode): >> >> TGSI_OPCODE_BFI(base, insert, offset, bits) >> = BFI(BFM(bits, offset), SHL(insert, offset), base) = >> s_lshl_b32 s1, s4, s6 >> s_bfm_b32 s0, s0, s6 >> v_mov_b32_e32 v0, s5 >> v_mov_b32_e32 v1, s1 >> v_bfi_b32 v0, s0, v1, v0 >> Ideally 3 instructions if all sources are vector registers. >> >> However, if TGSI_OPCODE_BFI is expanded into basic bitwise operations >> (BTW the result of BFM has 2 uses in BFI), LLVM applies this >> transformation: >> (X << S) & (Y << S) ----> (X & Y) << S >> Which breaks recognition of BFI and also the second use of BFM. >> Therefore this version calculates the same BFM expression twice. The >> first BFM is recognized by pattern matching, but the second BFM as >> well as BFI is unrecognizable due to the transformation. The result >> is: >> s_lshl_b32 s1, 1, s0 >> s_bfm_b32 s0, s0, s6 >> s_add_i32 s1, s1, -1 >> s_not_b32 s0, s0 >> s_and_b32 s1, s1, s4 >> s_and_b32 s0, s0, s5 >> s_lshl_b32 s1, s1, s6 >> s_or_b32 s0, s1, s0 >> >> There are 2 ways out of this: >> >> 1) Use BFM and BFI intrinsics in Mesa. Simple and unlikely to break in >> the future. >> >> 2) Try to recognize the expression tree seen by the backend. Changes >> in LLVM core can break it. More complicated shaders with more >> opportunities for transformations can break it too: >> >> def : Pat < >> (i32 (or (i32 (shl (i32 (and (i32 (add (i32 (shl 1, i32:$a)), >> -1)), i32:$b)), i32:$c)), >> (i32 (and (i32 (xor (i32 (shl (i32 (add (i32 (shl 1, i32:$a)), >> -1)), i32:$c)), -1)), i32:$d)))), >> (V_BFI_B32 (S_BFM_B32 $a, $c), >> (S_LSHL_B32 $b, $c), >> $d) >> >; > > I don't want to waste a lot of time discussing this, because it probably > doesn't matter too much in the long run. I'm fine with using > intrinsics, but I just wanted to point out a few things in case you or > someone else wants to get this working using LLVM IR. > > 1. Running the instruction combining pass should help with pattern > matching. This transforms common sequence into canonical forms which > make them easier to match in the backend. We should be running this > pass anyway as it has some good optimization. > > > diff --git a/src/gallium/drivers/radeon/radeon_setup_tgsi_llvm.c > b/src/gallium/drivers/radeon/radeon_setup_tgsi_llvm.c > index dce5b55..45c9eb8 100644 > --- a/src/gallium/drivers/radeon/radeon_setup_tgsi_llvm.c > +++ b/src/gallium/drivers/radeon/radeon_setup_tgsi_llvm.c > @@ -1444,6 +1444,7 @@ void radeon_llvm_finalize_module(struct > radeon_llvm_context * ctx) > LLVMAddPromoteMemoryToRegisterPass(gallivm->passmgr); > > /* Add some optimization passes */ > + LLVMAddInstructionCombiningPass(gallivm->passmgr); > LLVMAddScalarReplAggregatesPass(gallivm->passmgr); > LLVMAddLICMPass(gallivm->passmgr); > LLVMAddAggressiveDCEPass(gallivm->passmgr);
I tried this a long time ago and it broke a few tests which used the kill intrinsic. I'm testing it right now and it increases the shader binary size quite a lot. It looks like some DAG combines don't work with it anymore and the generated code looks worse overall. I occasionally use llc for debugging, which doesn't seem to use it either? Anyway, it looks like there's a lot of work needed just to fix the worse code generation. And when Mesa starts to use it, llc should use it too. > > > > > 2. We have a number of patterns for BFI already in AMDGPUInstructions.td > If we are modeling BFI using LLVM IR, we should be using one of those > patterns. All of them are useless for TGSI_OPCODE_BFI. > > 3. The BFI, BFM, LSHL sequence does not need to be matched in a single > pattern. There should be one pattern for BFI, one for BFM, and one for > LSHL. This won't work for TGSI, which I tried to explain in detail. Matching the whole pattern is the only way. The big pattern can recognize a hidden duplicated expression in it which the CSE can't. Any strict subset of the pattern doesn't make any sense alone. > > 4. Using intrinsics means no constant folding will be done, which could > lead to worse code in some cases. This is not a big issue. The GLSL compiler understands these instrinsics and takes care of constant folding. Marek _______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev