fs: Import image memory offset calculation code.

Jason Ekstrand Fri, 24 Jul 2015 07:53:09 -0700

On Fri, Jul 24, 2015 at 7:39 AM, Francisco Jerez <curroje...@riseup.net> wrote:
> Jason Ekstrand <ja...@jlekstrand.net> writes:
>
>> On Jul 24, 2015 4:00 AM, "Francisco Jerez" <curroje...@riseup.net> wrote:
>>>
>>> Jason Ekstrand <ja...@jlekstrand.net> writes:
>>>
>>> > Ok, I've looked through this again and convinced myself that it's
>>> > *mostly* correct.  I am a bit skeptical of the address swizzling for
>>> > Y-major tiling.
>>> >
>>> > I've also included some comments that I'd like to see added (assuming
>>> > they're correct).  Sometimes it's easier to write helpful comments if
>>> > the person who is confused does the writing. :-)
>>> >
>>>
>>> Thanks!  This will save me quite some guesswork, I'll add them to the
>>> patch.
>>>
>>> > Also, I'd just like to say that I'm terribly impressed after reading
>>> > through this.  You managed to handle an amazing number of cases
>>> > without so much as a single if statement or predicated instruction.
>>> > My hat's off to you sir.
>>> >
>>> Hah, I feel flattered. :)
>>>
>>> > Modulo Y-tiling and comments,
>>> >
>>> > Reviewed-by: Jason Ekstrand <jason.ekstr...@intel.com>
>>> >
>>> > On Tue, Jul 21, 2015 at 9:38 AM, Francisco Jerez <curroje...@riseup.net>
>> wrote:
>>> >> Define a function to calculate the memory address of the image
>>> >> location given by a vector of coordinates.  This is required in cases
>>> >> where we need to fall back to untyped surface access, which take a raw
>>> >> memory offset and know nothing about surface coordinates, type
>>> >> conversion or memory tiling and swizzling.  They are still useful
>>> >> because typed surface reads don't support any 64 or 128-bit formats on
>>> >> IVB, and they don't support any 128-bit formats on HSW and BDW.
>>> >>
>>> >> The tiling algorithm is implemented based on a number of parameters
>>> >> which are passed in as uniforms and determine whether the surface
>>> >> layout is X-tiled, Y-tiled or untiled.  This allows binding surfaces
>>> >> of different tiling layouts to the pipeline without recompiling the
>>> >> program.
>>> >>
>>> >> v2: Drop VEC4 suport.
>>> >> v3: Rebase.
>>> >> ---
>>> >>  .../drivers/dri/i965/brw_fs_surface_builder.cpp    | 108
>> +++++++++++++++++++++
>>> >>  1 file changed, 108 insertions(+)
>>> >>
>>> >> diff --git a/src/mesa/drivers/dri/i965/brw_fs_surface_builder.cpp
>> b/src/mesa/drivers/dri/i965/brw_fs_surface_builder.cpp
>>> >> index 5ee04de..0c879db 100644
>>> >> --- a/src/mesa/drivers/dri/i965/brw_fs_surface_builder.cpp
>>> >> +++ b/src/mesa/drivers/dri/i965/brw_fs_surface_builder.cpp
>>> >> @@ -215,4 +215,112 @@ namespace {
>>> >>           return BRW_PREDICATE_NORMAL;
>>> >>        }
>>> >>     }
>>> >> +
>>> >> +   namespace image_coordinates {
>>> >> +      /**
>>> >> +       * Calculate the offset in memory of the texel given by \p
>> coord.
>>> >> +       *
>>> >> +       * This is meant to be used with untyped surface messages to
>> access a
>>> >> +       * tiled surface, what involves taking into account the tiling
>> and
>>> >> +       * swizzling modes of the surface manually so it will hopefully
>> not
>>> >> +       * happen very often.
>>> >> +       */
>>> >> +      fs_reg
>>> >> +      emit_address_calculation(const fs_builder &bld, const fs_reg
>> &image,
>>> >> +                               const fs_reg &coord, unsigned dims)
>>> >> +      {
>>> >> +         const brw_device_info *devinfo = bld.shader->devinfo;
>>> >> +         const fs_reg off = offset(image, bld,
>> BRW_IMAGE_PARAM_OFFSET_OFFSET);
>>> >> +         const fs_reg stride = offset(image, bld,
>> BRW_IMAGE_PARAM_STRIDE_OFFSET);
>>> >> +         const fs_reg tile = offset(image, bld,
>> BRW_IMAGE_PARAM_TILING_OFFSET);
>>> >> +         const fs_reg swz = offset(image, bld,
>> BRW_IMAGE_PARAM_SWIZZLING_OFFSET);
>>> >> +         const fs_reg addr = bld.vgrf(BRW_REGISTER_TYPE_UD, 2);
>>> >> +         const fs_reg tmp = bld.vgrf(BRW_REGISTER_TYPE_UD, 2);
>>> >> +         const fs_reg minor = bld.vgrf(BRW_REGISTER_TYPE_UD, 2);
>>> >> +         const fs_reg major = bld.vgrf(BRW_REGISTER_TYPE_UD, 2);
>>> >> +         const fs_reg dst = bld.vgrf(BRW_REGISTER_TYPE_UD);
>>> >> +
>>> >> +         /* Shift the coordinates by the fixed surface offset. */
>>> >> +         for (unsigned c = 0; c < 2; ++c)
>>> >> +            bld.ADD(offset(addr, bld, c), offset(off, bld, c),
>>> >> +                    (c < dims ?
>>> >> +                     offset(retype(coord, BRW_REGISTER_TYPE_UD), bld,
>> c) :
>>> >> +                     fs_reg(0)));
>>> >> +
>>> >> +         if (dims > 2) {
>>> >> +            /* Decompose z into a major (tmp.y) and a minor (tmp.x)
>>> >> +             * index.
>>> >
>>> > Please add a comment:
>>> > /*
>>> > The layout of 3-D textures in memory is sort-of like a tiling format.
>>> > At each miplevel, the slices are arranged in rows of 2^level slices
>>> > per row.  The slice row is stored in tmp.y and the slice within the
>>> > row is stored in tmp.x.  The layout of 2-D array textures is much
>>> > simpler.  Each slice of the texture is storred as a 2-D texture with
>>> > all its miplevels and then one qpitch later, the next slice is stored
>>> > with its miplevels.  This code handles both by passing in the miplevel
>>> > as tile[2] for 3-D textures and 0 in tile[2] for 2-D array textures.
>>> > */
>>>
>>> Note that depending on the ARYSPC_LOD0 surface state flag array textures
>>> may also be interleaved in the opposite way -- All slices for LOD0, then
>>> all slices for LOD1, and so on.  We probably don't hit this case
>>> currently because MS images are not exposed but it's handled in the same
>>> way by setting stride[3] to the spacing between slices.  I'll add that
>>> to your comment.
>>>
>>> >> +             */
>>> >> +            bld.BFE(offset(tmp, bld, 0), offset(tile, bld, 2),
>> fs_reg(0),
>>> >> +                    offset(retype(coord, BRW_REGISTER_TYPE_UD), bld,
>> 2));
>>> >> +            bld.SHR(offset(tmp, bld, 1),
>>> >> +                    offset(retype(coord, BRW_REGISTER_TYPE_UD), bld,
>> 2),
>>> >> +                    offset(tile, bld, 2));
>>> >> +
>>> >> +            /* Take into account the horizontal (tmp.x) and vertical
>> (tmp.y)
>>> >> +             * slice offset.
>>> >> +             */
>>> >> +            for (unsigned c = 0; c < 2; ++c) {
>>> >> +               bld.MUL(offset(tmp, bld, c),
>>> >> +                       offset(stride, bld, 2 + c), offset(tmp, bld,
>> c));
>>> >> +               bld.ADD(offset(addr, bld, c),
>>> >> +                       offset(addr, bld, c), offset(tmp, bld, c));
>>> >> +            }
>>> >> +         }
>>> >> +
>>> >> +         if (dims > 1) {
>>> >
>>> > Please add as a comment:
>>> > /*
>>> > Calculate the major/minor x and y indices.  In order to accommodate
>>> > both X and Y tiling, the Y-major tiling format is treated as being a
>>> > bunch of narrow X-tiles placed next to each other.   This means that
>>> > the tile width for Y-tiling is actually the width of one sub-column of
>>> > the Y-major tile where each 4K tile has 8 512B sub-columns.
>>> >
>>> > The major Y value is the row of tiles in which the pixel lives.  The
>>> > major X value is the tile sub-column in which the pixel lives; for X
>>> > tiling, this is the same as the tile column, for Y tiling, each tile
>>> > has 8 sub-columns.  The minor X and Y indices are the position within
>>> > the sub-column.
>>> > */
>>> >> +            for (unsigned c = 0; c < 2; ++c) {
>>> >> +               /* Calculate the minor x and y indices. */
>>> >> +               bld.BFE(offset(minor, bld, c), offset(tile, bld, c),
>>> >> +                       fs_reg(0), offset(addr, bld, c));
>>> >> +
>>> >> +               /* Calculate the major x and y indices. */
>>> >> +               bld.SHR(offset(major, bld, c),
>>> >> +                       offset(addr, bld, c), offset(tile, bld, c));
>>> >> +            }
>>> >> +
>>> >> +            /* Calculate the texel index from the start of the tile
>> row and
>>> >> +             * the vertical coordinate of the row.
>>> >> +             * Equivalent to:
>>> >> +             *   tmp.x = (major.x << tile.y << tile.x) +
>>> >> +             *           (minor.y << tile.x) + minor.x
>>> >> +             *   tmp.y = major.y << tile.y
>>> >> +             */
>>> >> +            bld.SHL(tmp, major, offset(tile, bld, 1));
>>> >> +            bld.ADD(tmp, tmp, offset(minor, bld, 1));
>>> >> +            bld.SHL(tmp, tmp, offset(tile, bld, 0));
>>> >> +            bld.ADD(tmp, tmp, minor);
>>> >> +            bld.SHL(offset(tmp, bld, 1),
>>> >> +                    offset(major, bld, 1), offset(tile, bld, 1));
>>> >> +
>>> >> +            /* Add it to the start of the tile row. */
>>> >> +            bld.MUL(offset(tmp, bld, 1),
>>> >> +                    offset(tmp, bld, 1), offset(stride, bld, 1));
>>> >> +            bld.ADD(tmp, tmp, offset(tmp, bld, 1));
>>> >> +
>>> >> +            /* Multiply by the Bpp value. */
>>> >> +            bld.MUL(dst, tmp, stride);
>>> >> +
>>> >> +            if (devinfo->gen < 8 && !devinfo->is_baytrail) {
>>> >
>>> > Talking to Ken a bit about this. It seems as if this is actually the
>>> > check we want.  It's going to be a nasty bug if someone has a shader
>>> > cache on a HSW machine, runs some compute shaders, pulls out a stick
>>> > of ram (going from dual to single-channel), and starts it up again.
>>> > Probably best to compile the shader the same regardless of how much
>>> > ram you have.
>>> >
>>> Hmm, I guess we could also invalidate the cache if the flag changes?
>>> (or in case anything else changes from the brw_device_info structure)
>>>
>>> >> +               /* Take into account the two dynamically specified
>> shifts. */
>>> >> +               for (unsigned c = 0; c < 2; ++c)
>>> >> +                  bld.SHR(offset(tmp, bld, c), dst, offset(swz, bld,
>> c));
>>> >> +
>>> >> +               /* XOR tmp.x and tmp.y with bit 6 of the memory
>> address. */
>>> >> +               bld.XOR(tmp, tmp, offset(tmp, bld, 1));
>>> >> +               bld.AND(tmp, tmp, fs_reg(1 << 6));
>>> >> +               bld.XOR(dst, dst, tmp);
>>> >
>>> > I don't think this is correct for Y-tiled surfaces.  In Y-tiling, bit
>>> > 6 of the address is given by bit6 XOR bit9.  You have one extra XOR
>>> > which means you just get just bit9.
>>> >
>>> The extra XOR is made a no-op for the Y-tiled case in the same way that
>>> they are both made a no-op in the no-swizzling case.  I'll add another
>>> comment clarifying that.
>>
>> I don't think that actually works.  In the zero-shift case, the first XOR
>> produces an all-zero value, you mask to bit 6 (still all-zero) and XOR with
>> the original value and get the original back.  In the Y-tiled case, you
>> compute a ^ (a >> 3) and mask off bit 6.  This gives you bit6 ^ bit9 in the
>> sixth bit and zero everywhere else.  When you then XOR that with the
>> original, the two copies of bit6 cancel out and you are left with bit9 only.
>>
> Not exactly, in either case the result of the swizzling calculation
> is:
>   addr' = addr ^ ((1 << 6) & ((addr >> swz.x) ^ (addr >> swz.y)))
>
> If the surface is Y-tiled, swz.x = 3 and swz.y = 31, which gives:
>   addr' = addr ^ ((1 << 6) & ((addr[9] ^ addr[37]) << 6)) =
>           addr ^ ((addr[9] ^ 0) << 6) = addr ^ (addr[9] << 6)
>
> I guess I should add another comment here about how the XOR instructions
> are NOP-ed out when they're not needed. :)


Right.  I'd missed the fact that the default swizzle was not zero.
Also, it's 0xff, so it's 255, not 31.  I'm not sure what addr >> 255
does.  Maybe it only takes 5 bits and it's actually 31, but we should
probably use 31 rather than 255.
--Jason

>>> >> +            }
>>> >> +
>>> >> +         } else {
>>> >> +            /* Multiply by the Bpp/stride value. */
>>> >
>>> > I'm still confused as to how addr can have 2 things while dims == 1.
>>> > Does dims specify the per-slice dimension or total dimension?  That
>>> > should probably be documented at the top of the function.
>>> >
>>> dims specifies the dimension of the coord argument.  addr may still have
>>> a non-zero y coordinate because it's the sum of coord and the
>>> two-dimensional offset value specified by surface set-up.  1D texture
>>> slices and levels are laid out in a 2D layout just like the rest, the
>>> offset value specifies where in the 2D miptree the requested 1D image
>>> starts.  You may wonder why we pass a 2D offset to the shader instead of
>>> just fixing up the surface base memory address to point at the right
>>> slice at offset (0, 0), the answer is that in the general case (i.e. not
>>> necessarily 1D) it wouldn't work, because a level may start mid-tile and
>>> the result of shifting a tiled surface by less than the tile size is in
>>> general not a well-formed tiled surface -- In the 1D case it could
>>> probably be done though because I deliberately disabled tiling for those
>>> (36a17f0f9913), but it would require special-casing them in the surface
>>> state set-up code and another manual calculation of the base memory
>>> address of the right slice-level of the miptree.
>>>
>>> >> +            bld.MUL(offset(addr, bld, 1),
>>> >> +                    offset(addr, bld, 1), offset(stride, bld, 1));
>>> >> +            bld.ADD(addr, addr, offset(addr, bld, 1));
>>> >> +            bld.MUL(dst, addr, stride);
>>> >> +         }
>>> >> +
>>> >> +         return dst;
>>> >> +      }
>>> >> +   }
>>> >>  }
>>> >> --
>>> >> 2.4.3
>>> >>
>>> >> _______________________________________________
>>> >> mesa-dev mailing list
>>> >> mesa-dev@lists.freedesktop.org
>>> >> http://lists.freedesktop.org/mailman/listinfo/mesa-dev
_______________________________________________
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev

Re: [Mesa-dev] [PATCH 07/20] i965/fs: Import image memory offset calculation code.

Reply via email to