On Fri, Apr 10, 2015 at 12:52 PM, Ben Widawsky <benjamin.widaw...@intel.com> wrote: > Certain platforms support the ability to sample from a texture, and write it > out > to the file RT - thus saving a costly send instructions (note that this is a > potnential win if one wanted to backport to a tag that didn't have the patch > from Topi which removed excess MOVs from LOAD_PAYLOAD - 97caf5fa04dbd2), > > v2: Modify the algorithm. Instead of iterating in reverse through blocks and > insts, since the last block/inst is the only thing which can benefit. Rebased > on top of Ken's patching modifying is_last_send > > v3: Rebased over almost 2 months, and Incorporated feedback from Matt: > Some comment typo fixes and rewordings. > Whitespace > Move the optimization pass outside of the optimize loop > > v4: Some cosmetic changes requested from Ken. These changes ensured that the > optimization function always returned true when an optimization occurred, and > false when one did not. This behavior did not exist with the original patch. > As > a result, having the separate helper function which Matt did not like no > longer > made sense, and so now I believe everyone should be happy. > > Braswell data: > Benchmark (n=20) %diff > *OglBatch5 -1.4 > *OglBatch7 -1.79 > OglFillTexMulti 5.57 > OglFillTexSingle 1.16 > OglShMapPcf 0.05 > OglTexFilterAniso 3.01 > OglTexFilterTri 1.94 > > SKL data: > NONE COLLECTED > > No piglit regressions: > (http://otc-gfxtest-01.jf.intel.com:8080/view/dev/job/bwidawsk/112/) > > [*] I believe my measurements are incorrect for Batch5-7. If I add this new > optimization, but never emit the new instruction I see similar results. > > Cc: Matt Turner <matts...@gmail.com> > Cc: Kenneth Graunke <kenn...@whitecape.org> > Cc: Jason Ekstrand <ja...@jlekstrand.net> > Signed-off-by: Ben Widawsky <b...@bwidawsk.net> > --- > src/mesa/drivers/dri/i965/brw_fs.cpp | 92 > ++++++++++++++++++++++++++ > src/mesa/drivers/dri/i965/brw_fs.h | 3 + > src/mesa/drivers/dri/i965/brw_fs_generator.cpp | 12 ++++ > 3 files changed, 107 insertions(+) > > diff --git a/src/mesa/drivers/dri/i965/brw_fs.cpp > b/src/mesa/drivers/dri/i965/brw_fs.cpp > index 72000cf..d56d053 100644 > --- a/src/mesa/drivers/dri/i965/brw_fs.cpp > +++ b/src/mesa/drivers/dri/i965/brw_fs.cpp > @@ -2557,6 +2557,96 @@ fs_visitor::opt_algebraic() > return progress; > } > > +/** > + * Optimize sample messages which are followed by the final RT write. > + * > + * CHV, and GEN9+ can mark a texturing SEND instruction with EOT to have its > + * results sent directly to the framebuffer, bypassing the EU. Recognize the > + * final texturing results copied to the framebuffer write payload and modify > + * them to write to the framebuffer directly. > + */ > +bool > +fs_visitor::opt_sampler_eot() > +{ > + brw_wm_prog_key *key = (brw_wm_prog_key*) this->key; > + > + if (brw->gen < 9 && !brw->is_cherryview) > + return false; > + > + /* FINISHME: It should be possible to implement this optimization when > there > + * are multiple drawbuffers. > + */ > + if (key->nr_color_regions != 1) > + return false; > + > + /* Look for a texturing instruction immediately before the final > FB_WRITE. */ > + fs_inst *fb_write = (fs_inst *) cfg->blocks[cfg->num_blocks - 1]->end(); > + assert(fb_write->eot); > + assert(fb_write->opcode == FS_OPCODE_FB_WRITE); > + > + if (unlikely(fb_write->is_head_sentinel())) > + return false;
This is impossible. The assertions above are already assuming (rightly so) that it's an actual instruction. Remove these two lines. > + > + fs_inst *tex_inst = (fs_inst *) fb_write->prev; > + > + /* There wasn't one; nothing to do. */ > + if (unlikely(tex_inst->is_head_sentinel()) || !tex_inst->is_tex()) > + return false; This looks good. > + > + /* If there's no header present, we need to munge the LOAD_PAYLOAD as > well. > + * It's very likely to be the previous instruction. > + */ > + fs_inst *load_payload = (fs_inst *) tex_inst->prev; > + if (load_payload->is_head_sentinel() || > + load_payload->opcode != SHADER_OPCODE_LOAD_PAYLOAD) > + return false; > + > + assert(!tex_inst->eot); /* We can't get here twice */ > + assert((tex_inst->offset & (0xff << 24)) == 0); > + > + tex_inst->offset |= fb_write->target << 24; > + tex_inst->eot = true; > + fb_write->remove(cfg->blocks[cfg->num_blocks - 1]); > + > + /* If a header is present, marking the eot is sufficient. Otherwise, we > need > + * to create a new LOAD_PAYLOAD command with the same sources and a space > + * saved for the header. Using a new destination register not only makes > sure > + * we have enough space, but it will make sure the dead code eliminator > kills > + * the instruction that this will replace. > + */ > + if (tex_inst->header_present) > + return true; > + > + fs_reg send_header = vgrf(load_payload->sources + 1); > + fs_reg *new_sources = > + ralloc_array(mem_ctx, fs_reg, load_payload->sources + 1); > + > + new_sources[0] = fs_reg(); > + for (int i = 0; i < load_payload->sources; i++) > + new_sources[i+1] = load_payload->src[i]; > + > + /* The LOAD_PAYLOAD helper seems like the obvious choice here. However, it > + * requires a lot of information about the sources to appropriately figure > + * out the number of registers needed to be used. Given this stage in our > + * optimization, we may not have the appropriate GRFs required by > + * LOAD_PAYLOAD at this point (copy propagation). Therefore, we need to > + * manually emit the instruction. > + */ > + fs_inst *new_load_payload = new(mem_ctx) > fs_inst(SHADER_OPCODE_LOAD_PAYLOAD, > + load_payload->exec_size, > + send_header, > + new_sources, > + load_payload->sources + > 1); You don't have to change it if you don't want to, but I often try to indent this like fs_inst *new_load_payload = new(mem_ctx) fs_inst(SHADER_OPCODE_LOAD_PAYLOAD, ... aligning parameters that don't fit on the new(mem_ctx) line with the first parameter to fs_inst. > + > + new_load_payload->regs_written = load_payload->regs_written + 1; > + tex_inst->mlen++; > + tex_inst->header_present = true; > + tex_inst->insert_before(cfg->blocks[cfg->num_blocks - 1], > new_load_payload); > + tex_inst->src[0] = send_header; Set the tex_inst's destination to reg_null_ud or something here so that... > + > + return true; > +} > + > bool > fs_visitor::opt_register_renaming() > { > @@ -3763,6 +3853,8 @@ fs_visitor::optimize() > > pass_num = 0; > > + OPT(opt_sampler_eot); > + > if (OPT(lower_load_payload)) { > split_virtual_grfs(); > OPT(register_coalesce); > diff --git a/src/mesa/drivers/dri/i965/brw_fs.h > b/src/mesa/drivers/dri/i965/brw_fs.h > index 278a8ee..fcb4b63 100644 > --- a/src/mesa/drivers/dri/i965/brw_fs.h > +++ b/src/mesa/drivers/dri/i965/brw_fs.h > @@ -231,6 +231,9 @@ public: > bool compute_to_mrf(); > bool dead_code_eliminate(); > bool remove_duplicate_mrf_writes(); > + > + void combine_send_tex(bblock_t *block, fs_inst *tex_inst, fs_inst > *fb_write); > + bool opt_sampler_eot(); > bool virtual_grf_interferes(int a, int b); > void schedule_instructions(instruction_scheduler_mode mode); > void insert_gen4_send_dependency_workarounds(); > diff --git a/src/mesa/drivers/dri/i965/brw_fs_generator.cpp > b/src/mesa/drivers/dri/i965/brw_fs_generator.cpp > index 40e51aa..1eb468f 100644 > --- a/src/mesa/drivers/dri/i965/brw_fs_generator.cpp > +++ b/src/mesa/drivers/dri/i965/brw_fs_generator.cpp > @@ -517,6 +517,7 @@ fs_generator::generate_tex(fs_inst *inst, struct brw_reg > dst, struct brw_reg src > int rlen = 4; > uint32_t simd_mode; > uint32_t return_format; > + bool is_combined_send = inst->eot; > > switch (dst.type) { > case BRW_REGISTER_TYPE_D: > @@ -676,6 +677,11 @@ fs_generator::generate_tex(fs_inst *inst, struct brw_reg > dst, struct brw_reg src > dst = vec16(dst); > } > > + if (is_combined_send) { > + assert(brw->gen >= 9 || brw->is_cherryview); > + rlen = 0; > + } > + > assert(brw->gen < 7 || !inst->header_present || > src.file == BRW_GENERAL_REGISTER_FILE); > > @@ -781,6 +787,12 @@ fs_generator::generate_tex(fs_inst *inst, struct brw_reg > dst, struct brw_reg src > * so has already done marking. > */ > } > + > + if (is_combined_send) { > + brw_inst_set_eot(brw, brw_last_inst, true); > + brw_set_dest(p, brw_last_inst, brw_null_reg()); ... you don't need to set brw_set_dest here. Also, I don't think you need to brw_inst_set_eot here. Since you're setting tex_inst->eot it'll be set by brw_set_message_descriptor(). With those things fixed, Reviewed-by: Matt Turner <matts...@gmail.com> Thanks Ben. This is a pretty cool optimization. > + brw_inst_set_opcode(brw, brw_last_inst, BRW_OPCODE_SENDC); > + } > } > > > -- > 2.3.5 > _______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev