On Thu, Oct 22, 2015 at 11:19 AM, Matt Turner <matts...@gmail.com> wrote: > On Mon, Oct 19, 2015 at 10:38 AM, Alejandro Piñeiro > <apinhe...@igalia.com> wrote: >> vec4_live_variables tracks now each flag channel independently, so >> vec4_dead_code_eliminate can update the writemask of null registers, >> based on which component are alive at the moment. This would allow >> vec4_cmod_propagation to optimize out several movs involving null >> registers. >> >> v2: added support to track each flag channel independently at vec4 >> live_variables, as v1 assumed that it was already doing it, as >> pointed by Francisco Jerez >> --- >> >> I was tempted to split this commit in two, one for tracking each >> channel independently at vec4_live_variables, and other for using it >> at vec4_dead_code_eliminate, but in the end I concluded that made more >> sense as one commit, as are changes tightly tied. >> >> FWIW, the shader-db numbers for the optimization patch ("i965/vec4: >> adding vec4_cmod_propagation optimization", 2/5) changed, being now >> the following: >> >> total instructions in shared programs: 6240413 -> 6235841 (-0.07%) >> instructions in affected programs: 401933 -> 397361 (-1.14%) >> total loops in shared programs: 1979 -> 1979 (0.00%) >> helped: 2265 >> HURT: 0 >> >> That are slightly worse that the numbers for the first version of the >> patch, before adding that extra check. I will use this numbers if this >> patch is approved. >> >> FWIW, without this change on live_variables and dead_code eliminate, >> the numbers on the optimization would be the following ones: >> >> total instructions in shared programs: 6240413 -> 6240253 (-0.00%) >> instructions in affected programs: 18965 -> 18805 (-0.84%) >> total loops in shared programs: 1979 -> 1979 (0.00%) >> helped: 160 >> HURT: >> >> >> src/mesa/drivers/dri/i965/brw_ir_vec4.h | 26 ++++++++++++++++++ >> .../dri/i965/brw_vec4_dead_code_eliminate.cpp | 31 >> +++++++++++++++------- >> .../drivers/dri/i965/brw_vec4_live_variables.cpp | 14 ++++++---- >> 3 files changed, 57 insertions(+), 14 deletions(-) >> >> diff --git a/src/mesa/drivers/dri/i965/brw_ir_vec4.h >> b/src/mesa/drivers/dri/i965/brw_ir_vec4.h >> index 96dd633..934d7b1 100644 >> --- a/src/mesa/drivers/dri/i965/brw_ir_vec4.h >> +++ b/src/mesa/drivers/dri/i965/brw_ir_vec4.h >> @@ -185,6 +185,32 @@ public: >> return predicate || opcode == VS_OPCODE_UNPACK_FLAGS_SIMD4X2; >> } >> >> + bool reads_flag(unsigned c) >> + { >> + if (!reads_flag()) >> + return false; >> + >> + switch(predicate) { > > Space between switch and ( > >> + case BRW_PREDICATE_NONE: >> + return false; >> + break; > > No need for break since it returns early (same applies below). > >> + case BRW_PREDICATE_ALIGN16_REPLICATE_X: >> + return (c == 0 ? true : false); > > Simply return c == 0. (same below) > >> + break; >> + case BRW_PREDICATE_ALIGN16_REPLICATE_Y: >> + return (c == 1 ? true : false); >> + break; >> + case BRW_PREDICATE_ALIGN16_REPLICATE_Z: >> + return (c == 2 ? true : false); >> + break; >> + case BRW_PREDICATE_ALIGN16_REPLICATE_W: >> + return (c == 3 ? true : false); >> + break; >> + default: >> + return true; >> + } >> + } >> + >> bool writes_flag() >> { >> return (conditional_mod && (opcode != BRW_OPCODE_SEL && >> diff --git a/src/mesa/drivers/dri/i965/brw_vec4_dead_code_eliminate.cpp >> b/src/mesa/drivers/dri/i965/brw_vec4_dead_code_eliminate.cpp >> index 8fc7a36..d4216fd 100644 >> --- a/src/mesa/drivers/dri/i965/brw_vec4_dead_code_eliminate.cpp >> +++ b/src/mesa/drivers/dri/i965/brw_vec4_dead_code_eliminate.cpp >> @@ -78,13 +78,19 @@ vec4_visitor::dead_code_eliminate() >> sizeof(BITSET_WORD)); >> >> foreach_inst_in_block_reverse(vec4_instruction, inst, block) { >> - if (inst->dst.file == GRF && !inst->has_side_effects()) { >> + if ((inst->dst.file == GRF && !inst->has_side_effects()) || >> + (inst->dst.is_null() && inst->writes_flag())){ >> bool result_live[4] = { false }; >> >> - for (unsigned i = 0; i < inst->regs_written; i++) { >> - for (int c = 0; c < 4; c++) >> - result_live[c] |= BITSET_TEST( >> - live, var_from_reg(alloc, offset(inst->dst, i), c)); >> + if (inst->dst.file == GRF) { >> + for (unsigned i = 0; i < inst->regs_written; i++) { >> + for (int c = 0; c < 4; c++) >> + result_live[c] |= BITSET_TEST( >> + live, var_from_reg(alloc, offset(inst->dst, i), c)); >> + } >> + } else { >> + for (unsigned c = 0; c < 4; c++) >> + result_live[c] = BITSET_TEST(flag_live, c); >> } >> >> /* If the instruction can't do writemasking, then it's all or > > Below this hunk is the code that changes the opcode to NOP when all > the channels are writemasked: > > for (int c = 0; c < 4; c++) { > if (!result_live[c] && inst->dst.writemask & (1 << c)) { > inst->dst.writemask &= ~(1 << c); > progress = true; > > if (inst->dst.writemask == 0) { > if (inst->writes_accumulator || inst->writes_flag()) { > inst->dst = dst_reg(retype(brw_null_reg(), inst->dst.type)); > } else { > inst->opcode = BRW_OPCODE_NOP; > continue; > } > } > } > } > > So we simply set the destination to null, when all channels are dead > but the instruction is still writing the accumulator or the flag. But > that's weird, because as we've learned, the writemask actually affects > which channels of the flag (and presumably the accumulator too) are > written. Also, setting dst to brw_null_reg() will *reset* the > writemask to .xyzw. > > So if you had an instruction with a null destination writing the flag > where only some channels of the flag are live: > > cmp.ge.f0 null.xyzw, src0, src1 > (+f0.x) if > > we would do the right thing, and be left with a writemask of only .x > on the CMP, but if the whole flag is dead... wouldn't we end up with > the same .xyzw writemask that we started with? > > Maybe try simply doing > > for (int c = 0; c < 4; c++) { > if (!result_live[c] && inst->dst.writemask & (1 << c)) { > inst->dst.writemask &= ~(1 << c); > progress = true; > } > } > > if (inst->dst.writemask == 0) { > inst->opcode = BRW_OPCODE_NOP; > continue; > } > > which as a side effect will fix the continue to restart the loop it > was intended to restart (that is, restart the main instruction > scanning loop, not the loop over the channels). > > If that works, then just squash this change into the patch.
Alejandro says this causes a bunch of piglit regressions. In that case, let's skip this. In that case, with the small bits of feedback addressed, this patch is Reviewed-by: Matt Turner <matts...@gmail.com> _______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev