On 03/16/2015 10:06 AM, Matt Turner wrote: > On Wed, Mar 11, 2015 at 1:44 PM, Ian Romanick <i...@freedesktop.org> wrote: >> From: Ian Romanick <ian.d.roman...@intel.com> >> >> On platforms that do not natively generate 0u and ~0u for Boolean >> results, b2f expressions that look like >> >> f = b2f(expr cmp 0) >> >> will generate better code by pretending the expression is >> >> f = ir_triop_sel(0.0, 1.0, expr cmp 0) >> >> This is because the last instruction of "expr" can generate the >> condition code for the "cmp 0". This avoids having to do the "-(b & 1)" >> trick to generate 0u or ~0u for the Boolean result. This means code like >> >> mov(16) g16<1>F 1F >> mul.ge.f0(16) null g6<8,8,1>F g14<8,8,1>F >> (+f0) sel(16) m6<1>F g16<8,8,1>F 0F >> >> will be generated instead of >> >> mul(16) g2<1>F g12<8,8,1>F g4<8,8,1>F >> cmp.ge.f0(16) g2<1>D g4<8,8,1>F 0F > > Presumably this g4 should be g2?
Probably. I was cutting out of a diff of shader-db results, and I must have botched it. Here's the diff from shaders/anholt/6.shader_test: @@ -129,7 +129,7 @@ ) Native code for unnamed fragment shader 3 -SIMD8 shader: 77 instructions. 0 loops. Compacted 1232 to 832 bytes (32%) +SIMD8 shader: 76 instructions. 0 loops. Compacted 1216 to 816 bytes (33%) START B0 add(8) g9<1>UW g1.4<2,4,0>UW 0x10101010V { align1 }; mov(8) m3<1>F 16F { align1 }; @@ -163,7 +163,7 @@ add(8) g2<1>F g2<8,8,1>F g6<8,8,1>F { align1 compacted }; send(8) 2 g6<1>F g2<8,8,1>F math rsq mlen 1 rlen 1 { align1 }; -mul(8) g2<1>F g5<8,8,1>F g6<8,8,1>F { align1 compacted }; +mul.ge.f0(8) g2<1>F g5<8,8,1>F g6<8,8,1>F { align1 compacted }; mul(8) g5<1>F -g12<8,8,1>F -g12<8,8,1>F { align1 compacted }; mul(8) g7<1>F -g16<8,8,1>F -g16<8,8,1>F { align1 compacted }; mul(8) g8<1>F -g15<8,8,1>F -g15<8,8,1>F { align1 compacted }; @@ -194,14 +194,13 @@ send(8) 2 g3<1>F g3<8,8,1>F math pow mlen 2 rlen 1 { align1 }; mul(8) m6<1>F g11<8,8,1>F g8<8,8,1>F { align1 }; -cmp.ge.f0(8) g4<1>F g2<8,8,1>F 0F { align1 }; +mov(8) g4<1>F 1F { align1 }; mov(8) m2<1>F g6<8,8,1>F { align1 }; mov(8) m3<1>F g7<8,8,1>F { align1 }; add(8) g9<1>F g2<8,8,1>F g3<8,8,1>F { align1 compacted }; -and(8) g8<1>D g4<8,8,1>D 1D { align1 }; +(+f0) sel(8) g8<1>F g4<8,8,1>F 0F { align1 }; send(8) 2 g4<1>UW null sampler (1, 0, 3, 1) mlen 5 rlen 4 { align1 }; -and(8) g8<1>D -g8<8,8,1>D 0x3f800000UD { align1 }; mul(8) g9<1>F g9<8,8,1>F g4<8,8,1>F { align1 compacted }; mul(8) m3<1>F g8<8,8,1>F g9<8,8,1>F { align1 }; mul(8) g9<1>F g2<8,8,1>F 0.7F { align1 }; I think I can adjust the commit message to: "...This means code like mul.ge.f0(8) g2<1>F g5<8,8,1>F g6<8,8,1>F mov(8) g4<1>F 1F (+f0) sel(8) g8<1>F g4<8,8,1>F 0F will be generated instead of mul(8) g2<1>F g5<8,8,1>F g6<8,8,1>F cmp.ge.f0(8) g4<1>F g2<8,8,1>F 0F and(8) g8<1>D g4<8,8,1>D 1D and(8) g8<1>D -g8<8,8,1>D 0x3f800000UD" I'll update the comment in the code too. >> and(16) g4<1>D g2<8,8,1>D 1D >> and(16) m6<1>D -g4<8,8,1>D 0x3f800000UD >> >> v2: When the comparison is either == 0.0 or != 0.0 use the knowledge >> that the true (or false) case already results in zero would allow better >> code generation by possibly avoiding a load-immediate instruction. >> >> v3: Apply the optimization even when neither comparitor is zero. >> >> Shader-db results: >> >> GM45 (0x2A42): >> total instructions in shared programs: 3551002 -> 3550829 (-0.00%) >> instructions in affected programs: 33269 -> 33096 (-0.52%) >> helped: 121 >> >> Iron Lake (0x0046): >> total instructions in shared programs: 4993327 -> 4993146 (-0.00%) >> instructions in affected programs: 34199 -> 34018 (-0.53%) >> helped: 129 >> >> No change on other platforms. >> >> Signed-off-by: Ian Romanick <ian.d.roman...@intel.com> >> Cc: Tapani Palli <tapani.pa...@intel.com> >> --- >> src/mesa/drivers/dri/i965/brw_fs.h | 2 + >> src/mesa/drivers/dri/i965/brw_fs_visitor.cpp | 101 >> +++++++++++++++++++++++++-- >> 2 files changed, 99 insertions(+), 4 deletions(-) >> >> diff --git a/src/mesa/drivers/dri/i965/brw_fs.h >> b/src/mesa/drivers/dri/i965/brw_fs.h >> index d9d5858..075e90c 100644 >> --- a/src/mesa/drivers/dri/i965/brw_fs.h >> +++ b/src/mesa/drivers/dri/i965/brw_fs.h >> @@ -307,6 +307,7 @@ public: >> const fs_reg &a); >> void emit_minmax(enum brw_conditional_mod conditionalmod, const fs_reg >> &dst, >> const fs_reg &src0, const fs_reg &src1); >> + bool try_emit_b2f_of_comparison(ir_expression *ir); >> bool try_emit_saturate(ir_expression *ir); >> bool try_emit_line(ir_expression *ir); >> bool try_emit_mad(ir_expression *ir); >> @@ -317,6 +318,7 @@ public: >> bool opt_saturate_propagation(); >> bool opt_cmod_propagation(); >> void emit_bool_to_cond_code(ir_rvalue *condition); >> + void emit_bool_to_cond_code_of_reg(ir_expression *expr, fs_reg op[3]); >> void emit_if_gen6(ir_if *ir); >> void emit_unspill(bblock_t *block, fs_inst *inst, fs_reg reg, >> uint32_t spill_offset, int count); >> diff --git a/src/mesa/drivers/dri/i965/brw_fs_visitor.cpp >> b/src/mesa/drivers/dri/i965/brw_fs_visitor.cpp >> index 3025a9d..3d79796 100644 >> --- a/src/mesa/drivers/dri/i965/brw_fs_visitor.cpp >> +++ b/src/mesa/drivers/dri/i965/brw_fs_visitor.cpp >> @@ -475,6 +475,87 @@ fs_visitor::try_emit_mad(ir_expression *ir) >> return true; >> } >> >> +bool >> +fs_visitor::try_emit_b2f_of_comparison(ir_expression *ir) >> +{ >> + /* On platforms that do not natively generate 0u and ~0u for Boolean >> + * results, b2f expressions that look like >> + * >> + * f = b2f(expr cmp 0) >> + * >> + * will generate better code by pretending the expression is >> + * >> + * f = ir_triop_csel(0.0, 1.0, expr cmp 0) >> + * >> + * This is because the last instruction of "expr" can generate the >> + * condition code for the "cmp 0". This avoids having to do the "-(b & >> 1)" >> + * trick to generate 0u or ~0u for the Boolean result. This means code >> like >> + * >> + * mov(16) g16<1>F 1F >> + * mul.ge.f0(16) null g6<8,8,1>F g14<8,8,1>F >> + * (+f0) sel(16) m6<1>F g16<8,8,1>F 0F >> + * >> + * will be generated instead of >> + * >> + * mul(16) g2<1>F g12<8,8,1>F g4<8,8,1>F >> + * cmp.ge.f0(16) g2<1>D g4<8,8,1>F 0F >> + * and(16) g4<1>D g2<8,8,1>D 1D >> + * and(16) m6<1>D -g4<8,8,1>D 0x3f800000UD >> + * >> + * When the comparison is either == 0.0 or != 0.0 using the knowledge >> that >> + * the true (or false) case already results in zero would allow better >> code >> + * generation by possibly avoiding a load-immediate instruction. >> + */ >> + ir_expression *cmp = ir->operands[0]->as_expression(); >> + if (cmp == NULL) >> + return false; >> + >> + if (cmp->operation == ir_binop_equal || cmp->operation == >> ir_binop_nequal) { >> + for (unsigned i = 0; i < 2; i++) { >> + ir_constant *c = cmp->operands[i]->as_constant(); >> + if (c == NULL || !c->is_zero()) >> + continue; >> + >> + ir_expression *expr = cmp->operands[i ^ 1]->as_expression(); >> + if (expr != NULL) { >> + fs_reg op[2]; >> + >> + for (unsigned j = 0; j < 2; j++) { >> + cmp->operands[j]->accept(this); >> + op[j] = this->result; >> + >> + resolve_ud_negate(&op[j]); >> + } >> + >> + emit_bool_to_cond_code_of_reg(cmp, op); >> + >> + /* In this case we know when the condition is true, op[i ^ 1] >> + * contains zero. Invert the predicate, use op[i ^ 1] as src0, >> + * and immediate 1.0f as src1. >> + */ >> + this->result = vgrf(ir->type); >> + op[i ^ 1].type = BRW_REGISTER_TYPE_F; > > We just do op[1 - i] in tons of other places. No comment needed to explain > 1-i. It must be the old timer in me, but I'd swear that i^1 typically generates fewer instructions than 1-i on x86. I know it's not definitive, but with i^1 that function is 1025 bytes (excluding padding at the end) and with 1-i it's 1091 bytes (excluding padding at the end). _______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev