On 09/11/2013 10:00 PM, Chia-I Wu wrote: > From: Chia-I Wu <o...@lunarg.com> > > Replicate the gradient of the top-left pixel to the other three pixels in the > subspan, as how DDY is implemented. Before, different graidents were used for > pixels in the top row and pixels in the bottom row. > > This change results in a less accurate approximation. However, it improves > the performance of Xonotic with Ultra settings by 24.3879% +/- 0.832202% (at > 95.0% confidence) on Haswell. No noticeable image quality difference > observed. > > No piglit gpu.tests regressions. > > I failed to come up with an explanation for the performance difference. The > change does not make a difference on Ivy Bridge either. If anyone has the > insight, please kindly enlighten me. Performance differences may also be > observed on other games that call textureGrad and dFdx.
After all the experiments and discussions with the hardware guys, lets go ahead and do this. We should do a couple things, however. 1. Disable the optimization if the application explicitly sets GL_FRAGMENT_SHADER_DERIVATIVE_HINT to GL_NICEST. 2. Add a driconf option, as suggested by Chris, to disable the optimization. 3. Use the same DDX / DDY calculation on all platforms. 4. Update the commit message and the comment in the code with the explanation of the optimization (the HSW sample_d instruction does some optimizations if the same LOD is used for all pixels, etc.). > Signed-off-by: Chia-I Wu <o...@lunarg.com> > --- > src/mesa/drivers/dri/i965/brw_fs_emit.cpp | 17 +++++++++++++---- > 1 file changed, 13 insertions(+), 4 deletions(-) > > diff --git a/src/mesa/drivers/dri/i965/brw_fs_emit.cpp > b/src/mesa/drivers/dri/i965/brw_fs_emit.cpp > index bfb3d33..c0d24a0 100644 > --- a/src/mesa/drivers/dri/i965/brw_fs_emit.cpp > +++ b/src/mesa/drivers/dri/i965/brw_fs_emit.cpp > @@ -564,16 +564,25 @@ fs_generator::generate_tex(fs_inst *inst, struct > brw_reg dst, struct brw_reg src > void > fs_generator::generate_ddx(fs_inst *inst, struct brw_reg dst, struct brw_reg > src) > { > + /* approximate with ((ss0.tr - ss0.tl)x4 (ss1.tr - ss1.tl)x4) on Haswell, > + * which gives much better performance when the result is used with > + * sample_d > + */ > + unsigned vstride = (brw->is_haswell) ? BRW_VERTICAL_STRIDE_4 : > + BRW_VERTICAL_STRIDE_2; > + unsigned width = (brw->is_haswell) ? BRW_WIDTH_4 : > + BRW_WIDTH_2; > + > struct brw_reg src0 = brw_reg(src.file, src.nr, 1, > BRW_REGISTER_TYPE_F, > - BRW_VERTICAL_STRIDE_2, > - BRW_WIDTH_2, > + vstride, > + width, > BRW_HORIZONTAL_STRIDE_0, > BRW_SWIZZLE_XYZW, WRITEMASK_XYZW); > struct brw_reg src1 = brw_reg(src.file, src.nr, 0, > BRW_REGISTER_TYPE_F, > - BRW_VERTICAL_STRIDE_2, > - BRW_WIDTH_2, > + vstride, > + width, > BRW_HORIZONTAL_STRIDE_0, > BRW_SWIZZLE_XYZW, WRITEMASK_XYZW); > brw_ADD(p, dst, src0, negate(src1)); > _______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev