On Sun, Feb 08, 2015 at 02:48:02PM -0800, Matt Turner wrote: > On Sun, Feb 8, 2015 at 1:59 PM, Ben Widawsky > <benjamin.widaw...@intel.com> wrote: > > Certain platforms support the ability to sample from a texture, and write > > it out > > to the file RT - thus saving a costly send instructions (and before Topi's > > recent patch, excess MOVs), > > > > On BSW, the performance data isn't quite what I was hoping for. > > Statistically, > > it is an improvement. Several microbenchmarks improve between 0-2%, and > > Egypt > > I wouldn't be disappointed. That's pretty nice actually. Additionally, > I read some stuff that said this decreases energy consumption as well. >
Yeah - the disappointing part was it had absolutely no impact on the benchmark I was looking to improve. > > improves by similar amounts. No benchmarks regress with n=10. BDW and SKL > > also > > support this, testing would be welcome. > > SKL does, but I don't think BDW does. In fact, there's a couple of > assertions that say the same thing in this patch. :) > Oops. Yes, BDW does not support it. I'm glad I made this mistake though and that you caught it, because it made me find the parts in the spec again. I'll leave out the details because I am not sure what's safe to publish, but the first part is: "Note: The use of sampler messages with response length of zero is not recommended unless large polygons are being drawn, such as would be found in a GUI workload." > > The ministat with n=10 > > Egypt: > > 1.94611% +/- 1.35665% > > > > Egpyt Offscreen: > > 1.2429% +/- 0.834759% > > > > v2: Modify the algorithm. Instead of iterating in reverse through blocks and > > insts, since the last block/inst is the only thing which can benefit. > > Rebased on top of Ken's patching modifying is_last_send > > > > Cc: Kenneth Graunke <kenn...@whitecape.org> > > Cc: Jason Ekstrand <jason.ekstr...@intel.com> > > Signed-off-by: Ben Widawsky <b...@bwidawsk.net> > > --- > > src/mesa/drivers/dri/i965/brw_fs.cpp | 101 > > +++++++++++++++++++++ > > src/mesa/drivers/dri/i965/brw_fs.h | 3 + > > .../dri/i965/brw_fs_dead_code_eliminate.cpp | 2 +- > > src/mesa/drivers/dri/i965/brw_fs_generator.cpp | 12 +++ > > 4 files changed, 117 insertions(+), 1 deletion(-) > > > > diff --git a/src/mesa/drivers/dri/i965/brw_fs.cpp > > b/src/mesa/drivers/dri/i965/brw_fs.cpp > > index 78feee2..4b8fb4a 100644 > > --- a/src/mesa/drivers/dri/i965/brw_fs.cpp > > +++ b/src/mesa/drivers/dri/i965/brw_fs.cpp > > @@ -2452,6 +2452,106 @@ fs_visitor::opt_algebraic() > > return progress; > > } > > > > +/* Set up the sampler message to also have EOT set */ > > +void > > +fs_visitor::combine_send_tex(struct bblock_t *block, > > + fs_inst *tex_inst, > > + fs_inst *fb_write) > > +{ > > + assert((tex_inst->offset & (0xff << 24)) == 0); > > + assert(!tex_inst->eot); /* We can't get here twice */ > > + > > + fs_inst *old_load_payload = (fs_inst *) tex_inst->prev; > > + /* It's unlikely yet possible the load_payload is further above */ > > + if (old_load_payload->opcode != SHADER_OPCODE_LOAD_PAYLOAD) { > > + perf_debug("Potential send collapse skipped: Where's LOAD_PAYLOAD?"); > > We've never used perf_debug() to note that individual optimization > passes didn't work. I'm not sure how I feel about it. I can/will scrap this. > > > + return; > > + } > > + > > + tex_inst->offset |= fb_write->target << 24; > > + tex_inst->eot = true; > > + fb_write->remove(block); > > + > > + /* The texture instruction with EOT requires a header. If one is already > > + * present, there is nothing to do. > > + */ > > + if (tex_inst->header_present) > > + return; > > + > > + /* If a header is not present, we need to create a new LOAD_PAYLOAD > > command > > + * with the same sources and a space saved for the header. Using a new > > + * destination register not only makes sure we have enough space, but > > it will > > + * make sure the dead code eliminator kills the old command. > > s/command/instruction/ > > > + */ > > + tex_inst->mlen++; > > + tex_inst->header_present = true; > > + > > + fs_reg send_header = vgrf(old_load_payload->sources + 1); > > + fs_reg *new_sources = > > + ralloc_array(mem_ctx, fs_reg, old_load_payload->sources + 1); > > + > > + new_sources[0] = fs_reg(); > > + for (int i = 0; i < old_load_payload->sources; i++) > > + new_sources[i+1] = old_load_payload->src[i]; > > + > > + /* The LOAD_PAYLOAD helper seems like the obvious choice here. However, > > it > > + * requires a lot of information about the sources to appropriately > > figure > > + * out the number of registers needed to be used. Given this stage in > > our > > + * optimization, we may not have the appropriate GRFs required by > > + * LOAD_PAYLOAD at this point (copy propogation). Therefore, we need to > > typo: propagation > > I'm not sure what "w e may not have the appropriate GRFs ..." means? Here is the relevant part of the original IRC conversation: jekstrand [08:52:30] No, the problem is uniforms and immediates. jekstrand [08:52:58] They can't go in a LOAD_PAYLOAD directly because we don't know how many destination registers they take up. jekstrand [08:54:16] for LOAD_PAYLOAD to work, we need more information than a regular instruction. We need to know how many destination registers a given source takes up, we need to know whether it needs to use the second-half quarter control for the MOV that gets generated, etc. jekstrand [08:54:34] Using GRF sources more-or-less gives us this. Immediates don't. bwidawks [08:54:55] right - this is what confuses me though... the immediates seem to already be there. jekstrand [08:55:38] Right. The immediates can get there through copy-propagation and that's fine. However, they're not there when it's created. jekstrand [08:55:43] It's all a mess Do you have a preferred way to state this concisely? > > > + * manually emit the instruction. > > + * > > + * jekstrand: "The helper is too automagic for what you're doing." > > I'd leave this out. > > > + */ > > + fs_inst *new_load_payload = new(mem_ctx) > > fs_inst(SHADER_OPCODE_LOAD_PAYLOAD, > > + > > old_load_payload->exec_size, > > + send_header, > > + new_sources, > > + > > old_load_payload->sources + 1); > > + > > + new_load_payload->regs_written = old_load_payload->regs_written + 1; > > + > > + tex_inst->insert_before(block, new_load_payload); > > + tex_inst->src[0] = send_header; > > + > > Extra line here. > > > +} > > + > > +/* Try to collapse two sends into one. > > + * > > + * The first send should be sampling from a texture, the second send > > should be > > + * writing out the returned information to the RT. > > It's nice to give some information about hardware requirements. > Instead how about something like: > > CHV, SKL, and newer can mark a texturing SEND instruction with EOT to > have its results sent directly to the framebuffer, bypassing the EU. > > Recognize texturing results copied to the framebuffer write payload > and modify them to write to the framebuffer directly. > > > + */ > > +bool > > +fs_visitor::opt_sampler_eot() > > +{ > > + brw_wm_prog_key *key = (brw_wm_prog_key*) this->key; > > + > > + if (brw->gen < 9 && !brw->is_cherryview) > > + return false; > > + > > + if (key->nr_color_regions != 1) > > + return false; > > To confirm, there's not anything that actually prevents us from doing > this with the last write, is there? If not, let's put a FINISHME > comment here. > I am not actually sure if it is, or isn't possible - though I believe it is. I'll add this and mention that the finisher needs to verify it's possible first. > > + > > + fs_inst *fb_write = (fs_inst *) cfg->blocks[cfg->num_blocks - 1]->end(); > > + fs_inst *tex_inst = (fs_inst *) fb_write->prev; > > + > > + if (tex_inst->is_head_sentinel()) > > + return false; > > + > > + /* It will potentially be multiple iterations of optimizations before > > we get > > + * the condition we wish to optimize for. > > + */ > > + if (!tex_inst->is_tex()) > > + return false; > > + > > Can probably remove this if we move this optimization outside of the > loop (as mentioned below). > > > + assert(fb_write->eot); > > + assert(fb_write->opcode == FS_OPCODE_FB_WRITE); > > + > > + combine_send_tex(cfg->blocks[cfg->num_blocks - 1], tex_inst, fb_write); > > I'm not sure I see a lot of value in the separate combine_send_tex() > function. It's not that big and it's not reused. Inline it? > > > + > > + return true; > > +} > > + > > bool > > fs_visitor::opt_register_renaming() > > { > > @@ -3609,6 +3709,7 @@ fs_visitor::optimize() > > OPT(opt_peephole_predicated_break); > > OPT(opt_cmod_propagation); > > OPT(dead_code_eliminate); > > + OPT(opt_sampler_eot); > > Do you think we really need to do this in the optimization loop? > > I don't expect this to allow other optimization passes to make > additional progress, and we can obviously do it successfully only > once. I suspect we can do it after the optimization loop. > It's possible I didn't quite spot where you want me to put the optimization. I think that the way the code works right now, that will not work. The optimization is depending on DCE to kill off the only LOAD_PAYLOAD and it's corresponding MOVs. I agree that it is an optimization that can only occur once, and generally it doesn't belong in the progress loop though. > > OPT(opt_peephole_sel); > > OPT(dead_control_flow_eliminate, this); > > OPT(opt_register_renaming); > > diff --git a/src/mesa/drivers/dri/i965/brw_fs.h > > b/src/mesa/drivers/dri/i965/brw_fs.h > > index b95e2c0..c9f5ab0 100644 > > --- a/src/mesa/drivers/dri/i965/brw_fs.h > > +++ b/src/mesa/drivers/dri/i965/brw_fs.h > > @@ -459,6 +459,9 @@ public: > > bool compute_to_mrf(); > > bool dead_code_eliminate(); > > bool remove_duplicate_mrf_writes(); > > + > > + void combine_send_tex(bblock_t *block, fs_inst *tex_inst, fs_inst > > *fb_write); > > + bool opt_sampler_eot(); > > bool virtual_grf_interferes(int a, int b); > > void schedule_instructions(instruction_scheduler_mode mode); > > void insert_gen4_send_dependency_workarounds(); > > diff --git a/src/mesa/drivers/dri/i965/brw_fs_dead_code_eliminate.cpp > > b/src/mesa/drivers/dri/i965/brw_fs_dead_code_eliminate.cpp > > index d66808b..e5e52c8 100644 > > --- a/src/mesa/drivers/dri/i965/brw_fs_dead_code_eliminate.cpp > > +++ b/src/mesa/drivers/dri/i965/brw_fs_dead_code_eliminate.cpp > > @@ -52,7 +52,7 @@ fs_visitor::dead_code_eliminate() > > sizeof(BITSET_WORD)); > > > > foreach_inst_in_block_reverse(fs_inst, inst, block) { > > - if (inst->dst.file == GRF && !inst->has_side_effects()) { > > + if (!inst->eot && inst->dst.file == GRF && > > !inst->has_side_effects()) { > > I think it'd probably make sense to just put inst->eot in has_side_effects(). > > > bool result_live = false; > > > > if (inst->regs_written == 1) { > > diff --git a/src/mesa/drivers/dri/i965/brw_fs_generator.cpp > > b/src/mesa/drivers/dri/i965/brw_fs_generator.cpp > > index 8cd36f8..9f2ebaf 100644 > > --- a/src/mesa/drivers/dri/i965/brw_fs_generator.cpp > > +++ b/src/mesa/drivers/dri/i965/brw_fs_generator.cpp > > @@ -503,6 +503,7 @@ fs_generator::generate_tex(fs_inst *inst, struct > > brw_reg dst, struct brw_reg src > > int rlen = 4; > > uint32_t simd_mode; > > uint32_t return_format; > > + bool is_combined_send = inst->eot; > > > > switch (dst.type) { > > case BRW_REGISTER_TYPE_D: > > @@ -662,6 +663,11 @@ fs_generator::generate_tex(fs_inst *inst, struct > > brw_reg dst, struct brw_reg src > > dst = vec16(dst); > > } > > > > + if (is_combined_send) { > > + assert(brw->gen >= 9 || brw->is_cherryview); > > + rlen = 0; > > + } > > + > > assert(brw->gen < 7 || !inst->header_present || > > src.file == BRW_GENERAL_REGISTER_FILE); > > > > @@ -777,6 +783,12 @@ fs_generator::generate_tex(fs_inst *inst, struct > > brw_reg dst, struct brw_reg src > > * so has already done marking. > > */ > > } > > + > > + if (is_combined_send) { > > + brw_inst_set_eot(brw, brw_last_inst, true); > > + brw_set_dest(p, brw_last_inst, brw_null_reg()); > > I think you should set the destination of the SEND to the null > register in your optimization and remove this line. > > That would also mean you didn't need to modify dead code elimination > (since the SEND wouldn't have a GRF destination). > > > + brw_inst_set_opcode(brw, brw_last_inst, BRW_OPCODE_SENDC); > > + } > > } Everything I didn't comment on sounds good to me, and I'll give it a shot when I find some free time. Unfortunately, this kind of optimization I think does require real benchmarks as opposed to shader-db, since it won't be eliminating many instructions. Right now, benchmarking on BSW and SKL is both painfully slow :-( _______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev