On 02/02/2013 10:28 AM, Paul Berry wrote:
On 2 February 2013 03:44, Kenneth Graunke <kenn...@whitecape.org
<mailto:kenn...@whitecape.org>> wrote:

    The old calculation was off by one in certain cases.  The documentation
    contains a precise formula for how to calculate it, so just use that.


Wow, thank you for putting in the time and effort to track this down,
Ken.  I spent a fair amount of time this morning instrumenting the code
to understand exactly what has changed, so that I could reassure myself
that this is a true bug fix and not just a behavioural change that
coincidentally makes the heisenbug disappear.  I believe your fix is
correct, and I believe I understand why the bug happened (I even have a
speculation about how the hardware works that would explain why this fix
is necessary).

Can you see if my understanding of the bug makes sense to you?  It's a
little different from your "off by one in certain cases" explanation,
and I want to make sure that's not due to a misunderstanding on my part.

Yeah, the "it's sometimes off by one" explanation was more a "I've been messing around with this formula and trimming apitraces for hours and it's 4am and the formulas give different answers because this one apparently works". You're right, I've definitely seen it off by 0, 1, or 2.

My understanding of the bug as of this morning:

Previous to this patch, we thought that the only restrictions on
3DSTATE_SF's URB read length were (a) it needs to be large enough to
read all the VUE data that the SF needs, and (b) it can't be so large
that it tries to read VUE data that doesn't exist.  Since the VUE map
already tells us how much VUE data exists, we didn't bother worrying
about restriction (a); we just did the easy thing and programmed the
read length to satisfy restriction (b).

However, we didn't notice this erratum in the hardware docs: "[errata]
Corruption/Hang possible if length programmed larger than recommended".
Judging by the context surrounding this erratum, it's pretty clear that
it means "URB read length must be exactly the size necessary to read all
the VUE data that the SF needs, and no larger".  Which means that we
can't program the read length based on restriction (b)--we have to
program it based on restriction (a).  Which is what your patch does.

So, while it's technically correct that the old calculation could be off
by one in certain cases, it doesn't really capture the essence of the
problem, because the old calculation was simply the wrong calculation.
In some cases it was correct, and in other cases it was off by one or
sometimes more.  (In fact, in the test case Jordan isolated for bug
56920, I believe it was off by 2).  The real explanation here is that
the URB read size needs to precisely match the amount of data that the
SF consumes; it doesn't work to simply base it on the size of the VUE.

This makes a ton of sense to me.

Speculation about why it's necessary for the URB read size to precisely
match the amount of data that the SF consumes: maybe the only thing
synchronizing the SF's URB reads with the rest of the work it does is a
data dependency.  In other words, maybe the only thing stopping the SF
from skipping ahead to triangle B before triangle A's URB read completes
is that it has to wait for the data from that URB read before it can
finish processing triangle A.  If we make the URB read size too big, we
break that synchronization mechanism by causing the SF to receive extra
data that it's not smart enough to wait around for.  As a result, if we
run a program where the FS ignores the VS's last varying slot(s), and
follow it immediately by a program where the FS uses those slot(s), then
there's a danger that one of the VUE reads from the old program will
still be in flight when the new program starts, and it'll get
misinterpreted as varying data for the first triangle of the new
program.  This would explain pretty much everything we've seen about
this bug: why it only happens when the VS outputs extra varyings that
the FS doesn't use, why bug 56920 is so sensitive to the ordering of the
varyings, why bug 60172 bisects to a patch that changes varying order,
why the bug manifests as corrupted data in a single vertex of the first
triangle rendered by the second program, and finally, why doing a
glFinish() between the two programs masks the bug--because it gives
enough time for the extra URB read to complete before the second program
starts.

This sounds believable enough. And maybe they fixed some of that synchronization in Ivybridge, which is why it works there. Or perhaps we've just been lucky. I doubt we'll ever really know.

Does this sound right to you?  Does it jive with what you saw in bug
60172?  If so it would be nice to expand on the commit message a bit.  I
don't think we need to include my speculation above, but at a minimum it
would be nice to quote the erratum, and explain that as a consequence of
it, we have to base the URB read length on the maximum VUE slot that is
used by the SF rather than the maximum VUE slot that exists.

I cut and pasted your explanation into the commit message, since it's a very clear explanation of what's going on. Thanks Paul!

One other minor comment below.


    Fixes random corruption in Steam's Big Picture Mode, random corruption
    in PlaneShift (since the varying reordering code landed), and Piglit
    test ________.

    NOTE: This is a candidate for all stable branches.

    (The code needs to get cleaned up - the final result is quite ugly - but
      I wanted to put it on the list for the people working on these bugs.)

    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=56920
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=60172
    Signed-off-by
    <https://bugs.freedesktop.org/show_bug.cgi?id=60172Signed-off-by>:
    Kenneth Graunke <kenn...@whitecape.org <mailto:kenn...@whitecape.org>>
    ---
      src/mesa/drivers/dri/i965/gen6_sf_state.c | 52
    +++++++++++++++++++++----------
      1 file changed, 35 insertions(+), 17 deletions(-)

    I don't expect this to go upstream as is; it should get cleaned up
    first.

    We also should do the same thing for Ivybridge.  Haven't tried that yet.

    diff --git a/src/mesa/drivers/dri/i965/gen6_sf_state.c
    b/src/mesa/drivers/dri/i965/gen6_sf_state.c
    index c1bc252..83ff79c 100644
    --- a/src/mesa/drivers/dri/i965/gen6_sf_state.c
    +++ b/src/mesa/drivers/dri/i965/gen6_sf_state.c
    @@ -113,7 +113,6 @@ upload_sf_state(struct brw_context *brw)
      {
         struct intel_context *intel = &brw->intel;
         struct gl_context *ctx = &intel->ctx;
    -   uint32_t urb_entry_read_length;
         /* BRW_NEW_FRAGMENT_PROGRAM */
         uint32_t num_outputs =
    _mesa_bitcount_64(brw->fragment_program->Base.InputsRead);
         /* _NEW_LIGHT */
    @@ -125,26 +124,15 @@ upload_sf_state(struct brw_context *brw)
         bool multisampled_fbo = ctx->DrawBuffer->Visual.samples > 1;

         int attr = 0, input_index = 0;
    +   /* Skip over the first two entries in the VUE map (position and
    point size),
    +    * as they're passed through anyway and reading them is just
    overhead.
    +    */
         int urb_entry_read_offset = 1;
         float point_size;
         uint16_t attr_overrides[FRAG_ATTRIB_MAX];
         uint32_t point_sprite_origin;

    -   /* CACHE_NEW_VS_PROG */
    -   urb_entry_read_length = ((brw->vs.prog_data->vue_map.num_slots +
    1) / 2 -
    -                           urb_entry_read_offset);
    -   if (urb_entry_read_length == 0) {
    -      /* Setting the URB entry read length to 0 causes undefined
    behavior, so
    -       * if we have no URB data to read, set it to 1.
    -       */
    -      urb_entry_read_length = 1;
    -   }
    -
    -   dw1 =
    -      GEN6_SF_SWIZZLE_ENABLE |
    -      num_outputs << GEN6_SF_NUM_OUTPUTS_SHIFT |
    -      urb_entry_read_length << GEN6_SF_URB_ENTRY_READ_LENGTH_SHIFT |
    -      urb_entry_read_offset << GEN6_SF_URB_ENTRY_READ_OFFSET_SHIFT;
    +   dw1 = GEN6_SF_SWIZZLE_ENABLE | num_outputs <<
    GEN6_SF_NUM_OUTPUTS_SHIFT;

         dw2 = GEN6_SF_STATISTICS_ENABLE |
               GEN6_SF_VIEWPORT_TRANSFORM_ENABLE;
    @@ -280,6 +268,7 @@ upload_sf_state(struct brw_context *brw)
         /* Create the mapping from the FS inputs we produce to the VS
    outputs
          * they source from.
          */
    +   uint32_t max_source_attr = 0;
         for (; attr < FRAG_ATTRIB_MAX; attr++) {
            enum glsl_interp_qualifier interp_qualifier =
               brw->fragment_program->InterpQualifier[attr];
    @@ -312,12 +301,41 @@ upload_sf_state(struct brw_context *brw)
            assert(input_index < 16 || attr == input_index);

            /* CACHE_NEW_VS_PROG | _NEW_LIGHT | _NEW_PROGRAM */
    -      attr_overrides[input_index++] =
    +      attr_overrides[input_index] =
               get_attr_override(&brw->vs.prog_data->vue_map,
                                urb_entry_read_offset, attr,
                                 ctx->VertexProgram._TwoSideEnabled);
    +      bool swizzling = (attr_overrides[input_index] & 0xc0) != 0;
    +      uint32_t source_attr =
    +         (attr_overrides[input_index] & 0x1f) + (swizzling ? 1 : 0);
    +
    +      if (source_attr > max_source_attr)
    +         max_source_attr = source_attr;
    +
    +      ++input_index;


Possible clean-up idea: pass the address of max_source_attr to
get_attr_override(), and let get_attr_override() update it.  That avoids
having to take the bitfield back apart here, since get_attr_override()
already knows what the source_attr is and understands the effect of
swizzling.  Also it will make it easier to share this bug fix between
gen6 and gen7.

Yeah, I wanted to do something like that (but again, 4am). I took your suggestion and it's actually something I wouldn't be embarassed to upstream now. Thanks!

         }

    +   /* From the Sandy Bridge PRM, Volume 2, Part 1, documentation for
    +    * 3DSTATE_SF DWord 1 bits 15:11, "Vertex URB Entry Read Length":
    +    *
    +    * "This field should be set to the minimum length required to
    read the
    +    *  maximum source attribute.  The maximum source attribute is
    indicated
    +    *  by the maximum value of the enabled Attribute # Source
    Attribute if
    +    *  Attribute Swizzle Enable is set, Number of Output
    Attributes-1 if
    +    *  enable is not set.
    +    *  read_length = ceiling((max_source_attr + 1) / 2)
    +    *
    +    *  [errata] Corruption/Hang possible if length programmed
    larger than
    +    *  recommended"
    +    *
    +    * We use Attribute Swizzle Enable, so max_source_attr is bits
    4:0 of the
    +    * last element of attr_overrides set above.  Using ALIGN(x, 2)
    guarantees
    +    * that the division will produce ceiling() rather than truncation.
    +    */
    +   uint32_t urb_entry_read_length = ALIGN(max_source_attr + 1, 2) / 2;
    +   dw1 |= urb_entry_read_length <<
    GEN6_SF_URB_ENTRY_READ_LENGTH_SHIFT |
    +          urb_entry_read_offset << GEN6_SF_URB_ENTRY_READ_OFFSET_SHIFT;
    +
         for (; input_index < FRAG_ATTRIB_MAX; input_index++)
            attr_overrides[input_index] = 0;

    --
    1.8.1.2

    _______________________________________________
    mesa-dev mailing list
    mesa-dev@lists.freedesktop.org <mailto:mesa-dev@lists.freedesktop.org>
    http://lists.freedesktop.org/mailman/listinfo/mesa-dev



_______________________________________________
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev

Reply via email to