On 06/02/2015 10:54 AM, Kenneth Graunke wrote: > On Monday, June 01, 2015 03:14:26 PM Abdiel Janulgue wrote: >> This patch implements the binding table enable command which is also >> used to allocate a binding table pool where where hardware-generated >> binding table entries are flushed into. Each binding table offset in >> the binding table pool is unique per each shader stage that are >> enabled within a batch. >> >> Also insert the required brw_tracked_state objects to enable >> hw-generated binding tables in normal render path. >> >> v2: Clarify start of binding table pool offsets. (Topi) >> >> Signed-off-by: Abdiel Janulgue <abdiel.janul...@linux.intel.com> >> --- >> src/mesa/drivers/dri/i965/brw_binding_tables.c | 101 >> +++++++++++++++++++++++++ >> src/mesa/drivers/dri/i965/brw_context.c | 4 + >> src/mesa/drivers/dri/i965/brw_context.h | 6 ++ >> src/mesa/drivers/dri/i965/brw_state.h | 6 ++ >> src/mesa/drivers/dri/i965/brw_state_upload.c | 4 + >> src/mesa/drivers/dri/i965/intel_batchbuffer.c | 4 + >> 6 files changed, 125 insertions(+) >> >> diff --git a/src/mesa/drivers/dri/i965/brw_binding_tables.c >> b/src/mesa/drivers/dri/i965/brw_binding_tables.c >> index 98ff0dd..7554818 100644 >> --- a/src/mesa/drivers/dri/i965/brw_binding_tables.c >> +++ b/src/mesa/drivers/dri/i965/brw_binding_tables.c >> @@ -45,6 +45,23 @@ >> #include "intel_batchbuffer.h" >> >> /** >> + * We are required to start at this offset for binding table pointer state >> when >> + * HW-generated binding table is enabled otherwise the GPU will hung. Note >> that > > typo: "will hang" (not hung) > >> + * the binding table offsets are now relative to the binding tabe pool base > > typo: "table pool base" (not tabe pool) > >> + * address instead of from the state batch. >> + * >> + * From the Bspec 3DSTATE_BINDING_TABLE_POINTERS_{PS/VS/GS/DS/HS} > Pointer >> to >> + * PS Binding Table section lists the format as: >> + * >> + * "SurfaceStateOffset[16:6]BINDING_TABLE_STATE*256 When >> + * HW-generated binding table is enabled" >> + * >> + * When HW-generated binding tables are enabled, Surface State Offsets are >> + * 16-bit entries. >> + */ >> +static const uint32_t hw_bt_start_offset = 256 * sizeof(uint16_t); > > I'm confused. I don't see how this citation is relevant. Binding > tables are always 256 entries, regardless of the resource streamer. > > The text you cited is just saying that entries need to be 64 byte > aligned when using hardware binding tables, rather than the 32 byte > alignment that's sufficient when using software binding tables. > > It does appear that we gain an extra bit - bit 16 used to MBZ, but > now it's valid. > > You enforce the 64-byte alignment restriction in patch 5, via this line: > > brw->hw_bt_pool.next_offset += > ALIGN(prog_data->binding_table.size_bytes, > 64); > > At any rate, the text doesn't say *anything* about having to add > something to the starting address. Offset 0 seems perfectly valid. > > Is this really necessary? >
I'm the one who is being funny here. After looking harder and then doing some archaeological digs in my previous RS enabling efforts. I came to the conclusion you are right. The reason the hardware *seems* to enforce this arbitrary offset is that I skipped out the "disable RS on state base address update" workaround. Now that I reintroduced it back, hardware works completely fine even from offset zero. I'll update the code in v2. >> + >> +/** >> * Upload a shader stage's binding table as indirect state. >> * >> * This copies brw_stage_state::surf_offset[] into the indirect state >> section >> @@ -170,6 +187,90 @@ const struct brw_tracked_state brw_gs_binding_table = { >> .emit = brw_gs_upload_binding_table, >> }; >> >> +/** >> + * Hardware-generated binding tables for the resource streamer >> + */ >> +void >> +gen7_disable_hw_binding_tables(struct brw_context *brw) >> +{ >> + int pkt_len = brw->gen >= 8 ? 4 : 3; >> + >> + BEGIN_BATCH(3); >> + OUT_BATCH(_3DSTATE_BINDING_TABLE_POOL_ALLOC << 16 | (pkt_len - 2)); >> + OUT_BATCH(brw->is_haswell ? HSW_HW_BINDING_TABLE_RESERVED : 0); >> + OUT_BATCH(0); >> + ADVANCE_BATCH(); >> + >> + /* From the BSpec, 3D Pipeline > Resource Streamer > Hardware Binding >> + * Tables > Programming note >> + >> + * "When switching between HW and SW binding table generation, SW must >> + * issue a state cache invalidate." >> + */ > > Please cite the public documentation where possible. > > /* From the Haswell PRM, Volume 7: 3D Media GPGPU, > * 3DSTATE_BINDING_TABLE_POOL_ALLOC > Programming Note: > * > * "When switching between HW and SW binding table generation, SW must > * issue a state cache invalidate." > */ > >> + brw_emit_pipe_control_flush(brw, PIPE_CONTROL_STATE_CACHE_INVALIDATE); > > Is it worth throwing out hw_bt_pool.bo? > Probably not, I was hoping to have that bo as re-usable so we don't have to re-allocate after evey submit. As you mentioned below, I guess we don't have to worry about mucking up the offsets since only GPU writes to the pool. >> +} >> + >> +void >> +gen7_enable_hw_binding_tables(struct brw_context *brw) >> +{ >> + if (!brw->has_resource_streamer) { >> + gen7_disable_hw_binding_tables(brw); >> + return; > > kind of funny having a function called gen7_enable_hw_binding_tables > that disables hardware binding tables, but...hey :) > >> + } >> + >> + if (!brw->hw_bt_pool.bo) { >> + /* We use a single re-usable buffer object for the lifetime of the >> + * context and size it to maximum allowed binding tables that can be >> + * programmed per batch: >> + * >> + * BSpec, 3D Pipeline > Resource Streamer > Hardware Binding Tables: >> + * "A maximum of 16,383 Binding tables are allowed in any batch >> buffer" >> + */ > > /* From the Haswell PRM, Volume 7: 3D Media GPGPU, > * 3DSTATE_BINDING_TABLE_POOL_ALLOC > Programming Note: > * > * "A maximum of 16,383 Binding Tables are allowed in any batch buffer." > */ > >> + static const int max_size = 16383 * 4; > > While I believe this does enforce the restriction, it's a bit strange. > > My reading of the restriction is that you can only have 16,383 *binding > tables*, each of which could have 1-256 entries. So the maximum > possible value would be 16383 * 256 * 4. But, you'd have to count the > number of binding tables you emit...which seems like a pain. > > This enforces that no more than 16,383 binding table *entries* can be > put in the buffer, which seems OK. It also means that a batch buffer > can't refer to more than that many... > > That said, I think you've got a nasty bug here: there is no code to > detect when you've filled up the buffer (unless I've missed something?). > Presumably, the batch buffer usually fills up, triggering an implicit > intel_batchbuffer_flush that calls gen7_reset_rs_pool_offsets() and > saves you from buffer overflow. > > However, if the batch buffer were larger, or we repeatedly drew using > shaders that have large binding tables and relatively few other state > changes, I imagine it could fill up and access out of bounds memory, > crashing the GPU. Best to guard against this explicitly. > >> + brw->hw_bt_pool.bo = drm_intel_bo_alloc(brw->bufmgr, "hw_bt", >> + max_size, 64); >> + brw->hw_bt_pool.next_offset = hw_bt_start_offset; >> + } >> + >> + int pkt_len = brw->gen >= 8 ? 4 : 3; >> + uint32_t dw1 = BRW_HW_BINDING_TABLE_ENABLE; >> + if (brw->is_haswell) >> + dw1 |= SET_FIELD(GEN7_MOCS_L3, GEN7_HW_BT_MOCS) | >> + HSW_HW_BINDING_TABLE_RESERVED; > > We should set MOCS on Broadwell too. (Note that it has a different bit > location...) > >> + >> + BEGIN_BATCH(3); >> + OUT_BATCH(_3DSTATE_BINDING_TABLE_POOL_ALLOC << 16 | (pkt_len - 2)); >> + if (brw->gen >= 8) >> + OUT_RELOC64(brw->hw_bt_pool.bo, I915_GEM_DOMAIN_SAMPLER, 0, dw1); >> + else >> + OUT_RELOC(brw->hw_bt_pool.bo, I915_GEM_DOMAIN_SAMPLER, 0, dw1); >> + OUT_RELOC(brw->hw_bt_pool.bo, I915_GEM_DOMAIN_SAMPLER, 0, >> + brw->hw_bt_pool.bo->size); > > This is wrong for Broadwell. On Broadwell, you simply program the > buffer size, not a relocation to the maximum address in the buffer. > > i.e. it should be > > if (brw->gen >= 8) { > OUT_RELOC64(brw->hw_bt_pool.bo, I915_GEM_DOMAIN_SAMPLER, 0, dw1); > OUT_BATCH(brw->hw_bt_pool.bo->size); > } else { > OUT_RELOC(brw->hw_bt_pool.bo, I915_GEM_DOMAIN_SAMPLER, 0, dw1); > OUT_RELOC(brw->hw_bt_pool.bo, I915_GEM_DOMAIN_SAMPLER, 0, > brw->hw_bt_pool.bo->size); > } > >> + ADVANCE_BATCH(); >> + >> + /* From the BSpec, 3D Pipeline > Resource Streamer > Hardware Binding >> + * Tables > Programming note >> + >> + * "When switching between HW and SW binding table generation, SW must >> + * issue a state cache invalidate." >> + */ >> + brw_emit_pipe_control_flush(brw, PIPE_CONTROL_STATE_CACHE_INVALIDATE); >> +} >> + >> +void >> +gen7_reset_rs_pool_offsets(struct brw_context *brw) > > Let's call this hw_bt_pool. The resource streamer is related, but this > is about hardware binding tables. > >> +{ >> + brw->hw_bt_pool.next_offset = hw_bt_start_offset; >> +} > > I was concerned about trashing data here - if we flush the batch and > reset the offsets, but don't allocate a new buffer, the next batch will > happily scribble over our old data. But I suppose this is OK: because > the GPU writes the data (via 3DSTATE_BINDING_TABLE_POINTERS_XS?), not > the CPU, it should be pipelined, so the old data will no longer be in > use. So it should be safe. Yep, CPU does not write to the binding table pool at all. > >> + >> +const struct brw_tracked_state gen7_hw_binding_tables = { >> + .dirty = { >> + .mesa = 0, >> + .brw = BRW_NEW_BATCH, >> + }, >> + .emit = gen7_enable_hw_binding_tables >> +}; >> + >> /** @} */ >> >> /** >> diff --git a/src/mesa/drivers/dri/i965/brw_context.c >> b/src/mesa/drivers/dri/i965/brw_context.c >> index 274a237..e17046a 100644 >> --- a/src/mesa/drivers/dri/i965/brw_context.c >> +++ b/src/mesa/drivers/dri/i965/brw_context.c >> @@ -958,6 +958,10 @@ intelDestroyContext(__DRIcontext * driContextPriv) >> if (brw->wm.base.scratch_bo) >> drm_intel_bo_unreference(brw->wm.base.scratch_bo); >> >> + gen7_reset_rs_pool_offsets(brw); > > Not sure if this is worth doing. Either way I guess. > >> + drm_intel_bo_unreference(brw->hw_bt_pool.bo); >> + brw->hw_bt_pool.bo = NULL; >> + >> drm_intel_gem_context_destroy(brw->hw_ctx); >> >> if (ctx->swrast_context) { >> diff --git a/src/mesa/drivers/dri/i965/brw_context.h >> b/src/mesa/drivers/dri/i965/brw_context.h >> index 3f8e59d..94127b6 100644 >> --- a/src/mesa/drivers/dri/i965/brw_context.h >> +++ b/src/mesa/drivers/dri/i965/brw_context.h >> @@ -1404,6 +1404,12 @@ struct brw_context >> struct brw_cs_prog_data *prog_data; >> } cs; >> >> + /* RS hardware binding table */ >> + struct { >> + drm_intel_bo *bo; >> + uint32_t next_offset; >> + } hw_bt_pool; >> + >> struct { >> uint32_t state_offset; >> uint32_t blend_state_offset; >> diff --git a/src/mesa/drivers/dri/i965/brw_state.h >> b/src/mesa/drivers/dri/i965/brw_state.h >> index 987672f..622ce3f 100644 >> --- a/src/mesa/drivers/dri/i965/brw_state.h >> +++ b/src/mesa/drivers/dri/i965/brw_state.h >> @@ -132,6 +132,7 @@ extern const struct brw_tracked_state gen7_sol_state; >> extern const struct brw_tracked_state gen7_urb; >> extern const struct brw_tracked_state gen7_vs_state; >> extern const struct brw_tracked_state gen7_wm_state; >> +extern const struct brw_tracked_state gen7_hw_binding_tables; >> extern const struct brw_tracked_state haswell_cut_index; >> extern const struct brw_tracked_state gen8_blend_state; >> extern const struct brw_tracked_state gen8_disable_stages; >> @@ -372,6 +373,11 @@ gen7_upload_constant_state(struct brw_context *brw, >> const struct brw_stage_state *stage_state, >> bool active, unsigned opcode); >> >> +void gen7_rs_control(struct brw_context *brw, int enable); >> +void gen7_enable_hw_binding_tables(struct brw_context *brw); >> +void gen7_disable_hw_binding_tables(struct brw_context *brw); >> +void gen7_reset_rs_pool_offsets(struct brw_context *brw); >> + >> #ifdef __cplusplus >> } >> #endif >> diff --git a/src/mesa/drivers/dri/i965/brw_state_upload.c >> b/src/mesa/drivers/dri/i965/brw_state_upload.c >> index 84b0861..a2a63f3 100644 >> --- a/src/mesa/drivers/dri/i965/brw_state_upload.c >> +++ b/src/mesa/drivers/dri/i965/brw_state_upload.c >> @@ -191,6 +191,8 @@ static const struct brw_tracked_state >> *gen7_render_atoms[] = >> &gen6_color_calc_state, /* must do before cc unit */ >> &gen6_depth_stencil_state, /* must do before cc unit */ >> >> + &gen7_hw_binding_tables, /* Enable hw-generated binding tables for >> Haswell */ >> + >> &gen6_vs_push_constants, /* Before vs_state */ >> &gen6_gs_push_constants, /* Before gs_state */ >> &gen6_wm_push_constants, /* Before wm_surfaces and constant_buffer */ >> @@ -267,6 +269,8 @@ static const struct brw_tracked_state >> *gen8_render_atoms[] = >> &gen8_blend_state, >> &gen6_color_calc_state, >> >> + &gen7_hw_binding_tables, /* Enable hw-generated binding tables for >> Broadwell */ >> + >> &gen6_vs_push_constants, /* Before vs_state */ >> &gen6_gs_push_constants, /* Before gs_state */ >> &gen6_wm_push_constants, /* Before wm_surfaces and constant_buffer */ >> diff --git a/src/mesa/drivers/dri/i965/intel_batchbuffer.c >> b/src/mesa/drivers/dri/i965/intel_batchbuffer.c >> index a2a3a95..caeb31b 100644 >> --- a/src/mesa/drivers/dri/i965/intel_batchbuffer.c >> +++ b/src/mesa/drivers/dri/i965/intel_batchbuffer.c >> @@ -32,6 +32,7 @@ >> #include "intel_buffers.h" >> #include "intel_fbo.h" >> #include "brw_context.h" >> +#include "brw_state.h" >> >> #include <xf86drm.h> >> #include <i915_drm.h> >> @@ -379,6 +380,9 @@ _intel_batchbuffer_flush(struct brw_context *brw, >> drm_intel_bo_wait_rendering(brw->batch.bo); >> } >> >> + if (brw->gen >= 7) >> + gen7_reset_rs_pool_offsets(brw); >> + > > Checking brw->use_resource_streamer would make more sense, since > Ivybridge is Gen7 and it doesn't need this. Or, just calling it > universally since it's 1 line of harmless code on other platforms. > I'll put this in. >> /* Start a new batch buffer. */ >> brw_new_batch(brw); _______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev