On Fri, Nov 13, 2015 at 6:50 PM, Kenneth Graunke <kenn...@whitecape.org> wrote: > The geometry and tessellation control shader stages both read from > multiple URB entries (one per vertex). The thread payload contains > several URB handles which reference these separate memory segments. > > In GLSL, these inputs are represented as per-vertex arrays; the > outermost array index selects which vertex's inputs to read. This > array index does not necessarily need to be constant. > > To handle that, we need to use indirect addressing on GRFs to select > which of the thread payload registers has the appropriate URB handle. > (This is before we can even think about applying the pull model!) > > This patch introduces a new opcode which performs a MOV from a > source using VxH indirect addressing (which allows each of the 8 > SIMD channels to select distinct data.) > > Based on a patch by Jason Ekstrand. > > v2: Rename from INDIRECT_THREAD_PAYLOAD_MOV to MOV_INDIRECT; make it > a bit more generic. Use regs_read() instead of hacking up the > register allocator. (Suggested by Jason Ekstrand.) > > v3: Fix regs_read() to be more accurate for small unaligned regions. > Also rebase on Matt's work. > > Signed-off-by: Kenneth Graunke <kenn...@whitecape.org> > Reviewed-by: Abdiel Janulgue <abdiel.janul...@linux.intel.com> [v1] > > stash > --- > src/mesa/drivers/dri/i965/brw_defines.h | 10 ++++++++ > src/mesa/drivers/dri/i965/brw_fs.cpp | 28 +++++++++++++++++++++ > src/mesa/drivers/dri/i965/brw_fs.h | 5 ++++ > src/mesa/drivers/dri/i965/brw_fs_cse.cpp | 1 + > src/mesa/drivers/dri/i965/brw_fs_generator.cpp | 34 > ++++++++++++++++++++++++++ > src/mesa/drivers/dri/i965/brw_shader.cpp | 2 ++ > 6 files changed, 80 insertions(+) > > Here's an updated version. I changed the description of source 2 to be > "the length of the region that could be accessed, in bytes" rather than > "the maximum value of the indirect offset" (aka the largest possible value > of source 1), as that's what I actually wanted - it's much easier to use. > > (Notably, the largest possible value of the indirect offset means that > it affects the maximum possible base of the read...but you still read > type_sz() bytes beyond that...which is ugly. Size of the region is > much easier to use, as base + size = end...and you don't read beyond > the end.) > > This is all getting into dire "but isn't it off by 1?" territory, so > here are some examples demonstrating how one might use this: > > "I want to read some .xyzw component of a vec4 stored in r7.4." > > src[0] = r7.4<v;w,h>F > src[1] = <indirect offset> > src[2] = 16 bytes > > Because subnr != 0, we calculate subnr * type_sz, and get: > > subnr * sizeof(float) = 4 * 4 = 16 bytes > > We add this to region_length to get a total size of 32 bytes. > DIV_ROUND_UP(32, REG_SIZE) = 32 / 32 = 1 register read. > > Correct, because we're only reading the last 4 channels of r7, > so our region is contained within 1 register. > > "I want to read up to 6 UD channels starting at r7.7." > > src[0] = r7.1<v;w,h>UD > src[1] = <indirect offset> > src[2] = 6 * sizeof(UD) = 24 bytes > > Because subnr != 0, we calculate subnr * type_sz, and get: > > subnr * sizeof(UD) = 7 * 4 = 28 bytes. > > We add this to region_length to get a total of 24 + 28 = 52 bytes. > DIV_ROUND_UP(52, REG_SIZE) = ceil(52 / 32) = 2 > > Correct, because we read part of r7 and part of r8. > > "I want to get the ICP handles for vertex 2 out of 6 vertices." > > Let's assume the first ICP handle starts at r3. > > The ICP handles for vertex N take up whole register, so we would pick > src[0] = r(first icp handle)<8,8,1>UD > src[1] = <7,6,5,4,3,2,1> * 4 + 2 * REG_SIZE > src[2] = 6 vertices * REG_SIZE = 192 bytes > > subnr == 0, so we don't add anything; region_length stays as 192. > DIV_ROUND_UP(192, 32) = 6. We potentially read r3, r4, r5, r6, r7, r8, > which contain the ICP handles for vertex 0, 1, 2, 3, 4, 5 - 6 vertices. > > So I think this will work.
Yeah, this all seems totally reasonable. Reviewed-by: Jason Ekstrand <jason.ekstr...@intel.com> Thanks for taking the time to get this right! --Jason > diff --git a/src/mesa/drivers/dri/i965/brw_defines.h > b/src/mesa/drivers/dri/i965/brw_defines.h > index 6484484..0b8de63 100644 > --- a/src/mesa/drivers/dri/i965/brw_defines.h > +++ b/src/mesa/drivers/dri/i965/brw_defines.h > @@ -1289,6 +1289,16 @@ enum opcode { > * Calculate the high 32-bits of a 32x32 multiply. > */ > SHADER_OPCODE_MULH, > + > + /** > + * A MOV that uses VxH indirect addressing. > + * > + * Source 0: A register to start from (HW_REG). > + * Source 1: An indirect offset (in bytes, UD GRF). > + * Source 2: The length of the region that could be accessed (in bytes, > + * UD immediate). > + */ > + SHADER_OPCODE_MOV_INDIRECT, > }; > > enum brw_urb_write_flags { > diff --git a/src/mesa/drivers/dri/i965/brw_fs.cpp > b/src/mesa/drivers/dri/i965/brw_fs.cpp > index 80b8c8e..84b5920 100644 > --- a/src/mesa/drivers/dri/i965/brw_fs.cpp > +++ b/src/mesa/drivers/dri/i965/brw_fs.cpp > @@ -840,6 +840,34 @@ fs_inst::regs_read(int arg) const > case SHADER_OPCODE_BARRIER: > return 1; > > + case SHADER_OPCODE_MOV_INDIRECT: > + if (arg == 0) { > + assert(src[2].file == IMM); > + unsigned region_length = src[2].ud; > + > + if (src[0].file == FIXED_GRF) { > + /* If the start of the region is not register aligned, then > + * there's some portion of the register that's technically > + * unread at the beginning. > + * > + * However, the register allocator works in terms of whole > + * registers, and does not use subnr. It assumes that the > + * read starts at the beginning of the register, and extends > + * regs_read() whole registers beyond that. > + * > + * To compensate, we extend the region length to include this > + * unread portion at the beginning. > + */ > + if (src[0].subnr) > + region_length += src[0].subnr * type_sz(src[0].type); > + > + return DIV_ROUND_UP(region_length, REG_SIZE); > + } else { > + assert(!"Invalid register file"); > + } > + } > + break; > + > default: > if (is_tex() && arg == 0 && src[0].file == VGRF) > return mlen; > diff --git a/src/mesa/drivers/dri/i965/brw_fs.h > b/src/mesa/drivers/dri/i965/brw_fs.h > index f40e58b..cbfc07f 100644 > --- a/src/mesa/drivers/dri/i965/brw_fs.h > +++ b/src/mesa/drivers/dri/i965/brw_fs.h > @@ -527,6 +527,11 @@ private: > struct brw_reg offset, > struct brw_reg value); > > + void generate_mov_indirect(fs_inst *inst, > + struct brw_reg dst, > + struct brw_reg reg, > + struct brw_reg indirect_byte_offset); > + > bool patch_discard_jumps_to_fb_writes(); > > const struct brw_compiler *compiler; > diff --git a/src/mesa/drivers/dri/i965/brw_fs_cse.cpp > b/src/mesa/drivers/dri/i965/brw_fs_cse.cpp > index 8c67caf..3c40fcd 100644 > --- a/src/mesa/drivers/dri/i965/brw_fs_cse.cpp > +++ b/src/mesa/drivers/dri/i965/brw_fs_cse.cpp > @@ -78,6 +78,7 @@ is_expression(const fs_visitor *v, const fs_inst *const > inst) > case FS_OPCODE_LINTERP: > case SHADER_OPCODE_FIND_LIVE_CHANNEL: > case SHADER_OPCODE_BROADCAST: > + case SHADER_OPCODE_MOV_INDIRECT: > return true; > case SHADER_OPCODE_RCP: > case SHADER_OPCODE_RSQ: > diff --git a/src/mesa/drivers/dri/i965/brw_fs_generator.cpp > b/src/mesa/drivers/dri/i965/brw_fs_generator.cpp > index 139cda3..e5a286a 100644 > --- a/src/mesa/drivers/dri/i965/brw_fs_generator.cpp > +++ b/src/mesa/drivers/dri/i965/brw_fs_generator.cpp > @@ -372,6 +372,36 @@ fs_generator::generate_fb_write(fs_inst *inst, struct > brw_reg payload) > } > > void > +fs_generator::generate_mov_indirect(fs_inst *inst, > + struct brw_reg dst, > + struct brw_reg reg, > + struct brw_reg indirect_byte_offset) > +{ > + assert(indirect_byte_offset.type == BRW_REGISTER_TYPE_UD); > + assert(indirect_byte_offset.file == BRW_GENERAL_REGISTER_FILE); > + > + unsigned imm_byte_offset = reg.nr * REG_SIZE + reg.subnr; > + > + /* We use VxH indirect addressing, clobbering a0.0 through a0.7. */ > + struct brw_reg addr = vec8(brw_address_reg(0)); > + > + /* The destination stride of an instruction (in bytes) must be greater > + * than or equal to the size of the rest of the instruction. Since the > + * address register is of type UW, we can't use a D-type instruction. > + * In order to get around this, re re-type to UW and use a stride. > + */ > + indirect_byte_offset = > + retype(spread(indirect_byte_offset, 2), BRW_REGISTER_TYPE_UW); > + > + /* Prior to Broadwell, there are only 8 address registers. */ > + assert(inst->exec_size == 8 || devinfo->gen >= 8); > + > + brw_MOV(p, addr, indirect_byte_offset); > + brw_inst_set_mask_control(devinfo, brw_last_inst, BRW_MASK_DISABLE); > + brw_MOV(p, dst, retype(brw_VxH_indirect(0, imm_byte_offset), dst.type)); > +} > + > +void > fs_generator::generate_urb_read(fs_inst *inst, > struct brw_reg dst, > struct brw_reg header) > @@ -2079,6 +2109,10 @@ fs_generator::generate_code(const cfg_t *cfg, int > dispatch_width) > fill_count++; > break; > > + case SHADER_OPCODE_MOV_INDIRECT: > + generate_mov_indirect(inst, dst, src[0], src[1]); > + break; > + > case SHADER_OPCODE_URB_READ_SIMD8: > case SHADER_OPCODE_URB_READ_SIMD8_PER_SLOT: > generate_urb_read(inst, dst, src[0]); > diff --git a/src/mesa/drivers/dri/i965/brw_shader.cpp > b/src/mesa/drivers/dri/i965/brw_shader.cpp > index a438e18..50b288b 100644 > --- a/src/mesa/drivers/dri/i965/brw_shader.cpp > +++ b/src/mesa/drivers/dri/i965/brw_shader.cpp > @@ -554,6 +554,8 @@ brw_instruction_name(enum opcode op) > return "barrier"; > case SHADER_OPCODE_MULH: > return "mulh"; > + case SHADER_OPCODE_MOV_INDIRECT: > + return "mov_indirect"; > } > > unreachable("not reached"); > -- > 2.6.2 > > _______________________________________________ > mesa-dev mailing list > mesa-dev@lists.freedesktop.org > http://lists.freedesktop.org/mailman/listinfo/mesa-dev _______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev