When I tested this, I saw an intermittent BSW gpu hang. I haven't been able to confirm that it is due to the host-mem-barrier test.
-Mark Francisco Jerez <curroje...@riseup.net> writes: > The hardware documentation relating to the UAV HW-assisted coherency > mechanism and UAV access enable bits is scarce and sometimes > contradictory, and there's quite some guesswork behind this commit, so > let me summarize the background first: HSW and later hardware have > infrastructure to support a stricter form of data coherency between > shader invocations from separate primitives. The mechanism is > controlled by the "Accesses UAV" bits on 3DSTATE_VS, _HS, _DS, _GS and > _PS (or _PS_EXTRA on BDW+), and the "UAV Coherency Required" bit on > the 3DPRIMITIVE command. > > Regardless of whether "UAV Coherency Required" is set, the hardware > fixed-function units will increment a per-stage semaphore for each > request received if "Accesses UAV" is set for the same or any lower > stage. An implicit DC flush is emitted by the lowermost stage with > "Accesses UAV" set once it's done processing the request, this also > happens regardless of the value of "UAV Coherency Required". The > completion of the DC flush will cause the same stage and all previous > ones to decrement the semaphore, marking the UAV accesses for the > primitive as coherent with L3. > > The "UAV Coherency Required" 3DPRIMITIVE bit will cause a pipeline > stall before any threads are dispatched for the first FF stage with > "Accesses UAV" set until the semaphore is cleared for the same stage. > Effectively this guarantees that UAV memory accesses performed by > previous primitives from any stage will be strictly ordered (and > thanks to the implicit DC flush visible in memory) with UAV accesses > from the following primitives. > > None of this is required by the usual image, atomic counter and SSBO > GL APIs which have very relaxed cross-primitive coherency and ordering > requirements, so we don't actually ever set the "UAV Coherency > Required" bit -- Ordering with respect to shader invocations from > previous stages on the same primitive where there is a data dependency > is of course already guaranteed as the spec requires, regardless of > this mechanism being enabled. We do set the "Accesses UAV" bits > though since my commit ac7664e493655e290783c23a0412b9c70936da50 (which > this patch partially reverts), mainly because of comments like the > following from the BDW PRM: > >> 3DSTATE_GS >>[...] >> 12 Accesses UAV >> Format: Enable >> This field must be set when GS has a UAV access. > > There are similar comments in the documentation for the other > 3DSTATE_*S commands. The "must" part is misleading and unjustified > AFAIK. Most of the "Accesses UAV" bits don't seem to have any side > effects other than the implicit DC flushes and the related > book-keeping in anticipation for a subsequent primitive with "UAV > Coherency Required" set, so in most cases they are unnecessary and may > incur a performance penalty. There is an exception though. On Gen8+ > the PS_EXTRA UAV access bit influences the calculation of the PS > UAV-only and ThreadDispatchEnable signals which on previous > generations were set explicitly by the driver, so we cannot always > avoid enabling it on the PS stage. > > The primary motivation for this change is that in fact the hardware > coherency mechanism is buggy and will cause a rather non-deterministic > hang on Gen8 when VS is the only stage with "Accesses UAV" set and the > processing of a request terminates immediately after the implicit DC > flush is sent for a previous primitive with no additional vertices > being emitted for the second primitive, what will cause the hardware > to skip sending a second DC flush and cause the VS to stall > indefinitely waiting for a response from the DC (BDWGFX HSD 1912017). > This hardware bug can be reproduced on current master with the > spec@arb_shader_image_load_store@host-mem-barrier@Indirect/RaW piglit > subtest (if you have the patience to run it a few dozen times). > > The proposed workaround is to insert CS STALLs speculatively between > 3DPRIMITIVE commands when "Accesses UAV" is enabled for the VS stage > only. Because this would affect one of the hottest paths in the > driver and likely decrease performance even further due to the > unnecessary serialization, and because we don't actually need the > implicit DC flushes, it seems better to just disable them. > --- > src/mesa/drivers/dri/i965/gen7_gs_state.c | 4 +--- > src/mesa/drivers/dri/i965/gen7_vs_state.c | 4 +--- > src/mesa/drivers/dri/i965/gen7_wm_state.c | 12 ++++++++---- > src/mesa/drivers/dri/i965/gen8_gs_state.c | 4 +--- > src/mesa/drivers/dri/i965/gen8_ps_state.c | 32 > ++++++++++++++++++++++++++++--- > src/mesa/drivers/dri/i965/gen8_vs_state.c | 4 +--- > 6 files changed, 41 insertions(+), 19 deletions(-) > > diff --git a/src/mesa/drivers/dri/i965/gen7_gs_state.c > b/src/mesa/drivers/dri/i965/gen7_gs_state.c > index 497ecec..8d6d3fe 100644 > --- a/src/mesa/drivers/dri/i965/gen7_gs_state.c > +++ b/src/mesa/drivers/dri/i965/gen7_gs_state.c > @@ -59,9 +59,7 @@ upload_gs_state(struct brw_context *brw) > OUT_BATCH(((ALIGN(stage_state->sampler_count, 4)/4) << > GEN6_GS_SAMPLER_COUNT_SHIFT) | > ((brw->gs.prog_data->base.base.binding_table.size_bytes / 4) > << > - GEN6_GS_BINDING_TABLE_ENTRY_COUNT_SHIFT) | > - (brw->is_haswell && prog_data->base.nr_image_params ? > - HSW_GS_UAV_ACCESS_ENABLE : 0)); > + GEN6_GS_BINDING_TABLE_ENTRY_COUNT_SHIFT)); > > if (brw->gs.prog_data->base.base.total_scratch) { > OUT_RELOC(stage_state->scratch_bo, > diff --git a/src/mesa/drivers/dri/i965/gen7_vs_state.c > b/src/mesa/drivers/dri/i965/gen7_vs_state.c > index b7e4858..a18dc69 100644 > --- a/src/mesa/drivers/dri/i965/gen7_vs_state.c > +++ b/src/mesa/drivers/dri/i965/gen7_vs_state.c > @@ -126,9 +126,7 @@ upload_vs_state(struct brw_context *brw) > ((ALIGN(stage_state->sampler_count, 4)/4) << > GEN6_VS_SAMPLER_COUNT_SHIFT) | > ((brw->vs.prog_data->base.base.binding_table.size_bytes / 4) << > - GEN6_VS_BINDING_TABLE_ENTRY_COUNT_SHIFT) | > - (brw->is_haswell && prog_data->base.nr_image_params ? > - HSW_VS_UAV_ACCESS_ENABLE : 0)); > + GEN6_VS_BINDING_TABLE_ENTRY_COUNT_SHIFT)); > > if (prog_data->base.total_scratch) { > OUT_RELOC(stage_state->scratch_bo, > diff --git a/src/mesa/drivers/dri/i965/gen7_wm_state.c > b/src/mesa/drivers/dri/i965/gen7_wm_state.c > index fd6dab5..06d5e65 100644 > --- a/src/mesa/drivers/dri/i965/gen7_wm_state.c > +++ b/src/mesa/drivers/dri/i965/gen7_wm_state.c > @@ -113,7 +113,14 @@ upload_wm_state(struct brw_context *brw) > else if (prog_data->base.nr_image_params) > dw1 |= GEN7_WM_EARLY_DS_CONTROL_PSEXEC; > > - /* _NEW_BUFFERS | _NEW_COLOR */ > + /* The "UAV access enable" bits are unnecessary on HSW because they only > + * seem to have an effect on the HW-assisted coherency mechanism which we > + * don't need, and the rasterization-related UAV_ONLY flag and the > + * DISPATCH_ENABLE bit can be set independently from it. > + * C.f. gen8_upload_ps_extra(). > + * > + * BRW_NEW_FRAGMENT_PROGRAM | BRW_NEW_FS_PROG_DATA | _NEW_BUFFERS | > _NEW_COLOR > + */ > if (brw->is_haswell && > !(brw_color_buffer_write_enabled(brw) || writes_depth) && > prog_data->base.nr_image_params) > @@ -221,9 +228,6 @@ gen7_upload_ps_state(struct brw_context *brw, > _mesa_get_min_invocations_per_fragment(ctx, fp, false); > assert(min_inv_per_frag >= 1); > > - if (brw->is_haswell && prog_data->base.nr_image_params) > - dw4 |= HSW_PS_UAV_ACCESS_ENABLE; > - > if (prog_data->prog_offset_16 || prog_data->no_8) { > dw4 |= GEN7_PS_16_DISPATCH_ENABLE; > if (!prog_data->no_8 && min_inv_per_frag == 1) { > diff --git a/src/mesa/drivers/dri/i965/gen8_gs_state.c > b/src/mesa/drivers/dri/i965/gen8_gs_state.c > index 81bd3b2..26a02d3 100644 > --- a/src/mesa/drivers/dri/i965/gen8_gs_state.c > +++ b/src/mesa/drivers/dri/i965/gen8_gs_state.c > @@ -52,9 +52,7 @@ gen8_upload_gs_state(struct brw_context *brw) > ((ALIGN(stage_state->sampler_count, 4)/4) << > GEN6_GS_SAMPLER_COUNT_SHIFT) | > ((prog_data->base.binding_table.size_bytes / 4) << > - GEN6_GS_BINDING_TABLE_ENTRY_COUNT_SHIFT) | > - (prog_data->base.nr_image_params ? > - HSW_GS_UAV_ACCESS_ENABLE : 0)); > + GEN6_GS_BINDING_TABLE_ENTRY_COUNT_SHIFT)); > > if (brw->gs.prog_data->base.base.total_scratch) { > OUT_RELOC64(stage_state->scratch_bo, > diff --git a/src/mesa/drivers/dri/i965/gen8_ps_state.c > b/src/mesa/drivers/dri/i965/gen8_ps_state.c > index ae18f0f..a6c9ab3 100644 > --- a/src/mesa/drivers/dri/i965/gen8_ps_state.c > +++ b/src/mesa/drivers/dri/i965/gen8_ps_state.c > @@ -25,6 +25,7 @@ > #include "program/program.h" > #include "brw_state.h" > #include "brw_defines.h" > +#include "brw_wm.h" > #include "intel_batchbuffer.h" > > void > @@ -61,8 +62,33 @@ gen8_upload_ps_extra(struct brw_context *brw, > if (brw->gen >= 9 && prog_data->pulls_bary) > dw1 |= GEN9_PSX_SHADER_PULLS_BARY; > > - if (_mesa_active_fragment_shader_has_atomic_ops(&brw->ctx) || > - prog_data->base.nr_image_params) > + /* The stricter cross-primitive coherency guarantees that the hardware > + * gives us with the "Accesses UAV" bit set for at least one shader stage > + * and the "UAV coherency required" bit set on the 3DPRIMITIVE command are > + * redundant within the current image, atomic counter and SSBO GL APIs, > + * which all have very loose ordering and coherency requirements and > + * generally rely on the application to insert explicit barriers when a > + * shader invocation is expected to see the memory writes performed by the > + * invocations of some previous primitive. Regardless of the value of > "UAV > + * coherency required", the "Accesses UAV" bits will implicitly cause an > in > + * most cases useless DC flush when the lowermost stage with the bit set > + * finishes execution. > + * > + * It would be nice to disable it, but in some cases we can't because on > + * Gen8+ it also has an influence on rasterization via the PS UAV-only > + * signal (which could be set independently from the coherency mechanism > in > + * the 3DSTATE_WM command on Gen7), and because in some cases it will > + * determine whether the hardware skips execution of the fragment shader > or > + * not via the ThreadDispatchEnable signal. However if we know that > + * GEN8_PS_BLEND_HAS_WRITEABLE_RT is going to be set and > + * GEN8_PSX_PIXEL_SHADER_NO_RT_WRITE is not set it shouldn't make any > + * difference so we may just disable it here. > + * > + * BRW_NEW_FS_PROG_DATA | BRW_NEW_FRAGMENT_PROGRAM | _NEW_BUFFERS | > _NEW_COLOR > + */ > + if ((_mesa_active_fragment_shader_has_atomic_ops(&brw->ctx) || > + prog_data->base.nr_image_params) && > + !brw_color_buffer_write_enabled(brw)) > dw1 |= GEN8_PSX_SHADER_HAS_UAV; > > BEGIN_BATCH(2); > @@ -87,7 +113,7 @@ upload_ps_extra(struct brw_context *brw) > > const struct brw_tracked_state gen8_ps_extra = { > .dirty = { > - .mesa = 0, > + .mesa = _NEW_BUFFERS | _NEW_COLOR, > .brw = BRW_NEW_CONTEXT | > BRW_NEW_FRAGMENT_PROGRAM | > BRW_NEW_FS_PROG_DATA | > diff --git a/src/mesa/drivers/dri/i965/gen8_vs_state.c > b/src/mesa/drivers/dri/i965/gen8_vs_state.c > index 8b5048b..28f5add 100644 > --- a/src/mesa/drivers/dri/i965/gen8_vs_state.c > +++ b/src/mesa/drivers/dri/i965/gen8_vs_state.c > @@ -53,9 +53,7 @@ upload_vs_state(struct brw_context *brw) > ((ALIGN(stage_state->sampler_count, 4) / 4) << > GEN6_VS_SAMPLER_COUNT_SHIFT) | > ((prog_data->base.binding_table.size_bytes / 4) << > - GEN6_VS_BINDING_TABLE_ENTRY_COUNT_SHIFT) | > - (prog_data->base.nr_image_params ? > - HSW_VS_UAV_ACCESS_ENABLE : 0)); > + GEN6_VS_BINDING_TABLE_ENTRY_COUNT_SHIFT)); > > if (prog_data->base.total_scratch) { > OUT_RELOC64(stage_state->scratch_bo, > -- > 2.4.6 > > _______________________________________________ > mesa-dev mailing list > mesa-dev@lists.freedesktop.org > http://lists.freedesktop.org/mailman/listinfo/mesa-dev _______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev