On Mon, Sep 16, 2013 at 3:50 AM, Mark Mueller <markkmuel...@gmail.com> wrote: > > > > On Fri, Sep 13, 2013 at 2:15 PM, Paul Berry <stereotype...@gmail.com> wrote: >> >> On 12 September 2013 22:06, Chia-I Wu <olva...@gmail.com> wrote: >>> >>> From: Chia-I Wu <o...@lunarg.com> >>> >>> Consider only the top-left and top-right pixels to approximate DDX in a >>> 2x2 >>> subspan, unless the application or the user requests a more accurate >>> approximation. This results in a less accurate approximation. However, >>> it >>> improves the performance of Xonotic with Ultra settings by 24.3879% +/- >>> 0.832202% (at 95.0% confidence) on Haswell. No noticeable image quality >>> difference observed. >>> >>> No piglit gpu.tests regressions (tested with v1) >>> >>> I failed to come up with an explanation for the performance difference, >>> as the >>> change does not affect Ivy Bridge. If anyone has the insight, please >>> kindly >>> enlighten me. Performance differences may also be observed on other >>> games >>> that call textureGrad and dFdx. >>> >>> v2: Honor GL_FRAGMENT_SHADER_DERIVATIVE_HINT and add a drirc option. >>> Update >>> comments. >> >> >> I'm not entirely comfortable making a change that has a known negative >> impact on computational accuracy (even one that leads to such an impressive >> performance improvement) when we don't have any theories as to why the >> performance improvement happens, or why the improvement doesn't apply to Ivy >> Bridge. In my experience, making changes to the codebase without >> understanding why they improve things almost always leads to improvements >> that are brittle, since it's likely that the true source of the improvement >> is a coincidence that will be wiped out by some future change (or won't be >> relevant to client programs other than this particular benchmark). Having a >> theory as to why the performance improvement happens would help us be >> confident that we're applying the right fix under the right circumstances. > There's another angle to approach this and that is to develop a simple test > case that will show the different results across a range of computational > accuracy and run the test on proprietary drivers for the same hardware to > determine what settings they are using. Yes, I have a little test. On Windows, rendering with texture2D() or textureGrad() does not have a noticeable impact on the performance. But I am not sure how to change dFdx() accuracy from the shaders. On Linux, I did that by modifying the driver.
> >> >> >> For example, here's one theory as to why we might be seeing an >> improvement: perhaps Haswell's sample_d processing is smart enough to >> realize that when all the gradient values within a sub-span are the same, >> that means that all of the sampling for the sub-span will come from the same >> LOD, and that allows it to short-cut some expensive step in the LOD >> calculation. Perhaps the same improvement isn't seen on Ivy Bridge because >> Ivy Bridge's sample_d processing logic is less sophisticated, so it's unable >> to perform the optimization. If this is the case, then conditioning the >> optimization on brw->is_haswell (as you've done) makes sense. >> >> Another possible explanation for the Haswell vs Ivy Bridge difference is >> that perhaps Ivy Bridge, being a lower-performing chip, has other >> bottlenecks that make the optimization irrelevant for this particular >> benchmark, but potentially still useful for other benchmarks. For instance, >> maybe when this benchmark executes on Ivy Bridge, the texture that's being >> sampled from is located in sufficiently distant memory that optimizing the >> sample_d's memory accesses makes no difference, since the bottleneck is the >> speed with which the texture can be read into cache, rather than the speed >> of operation of sample_d. If this explanation is correct, then it might be >> worth applying the optimization to both Ivy Bridge and Haswell (and perhaps >> Sandy Bridge as well), since it might conceivably benefit those other chips >> when running applications that place less cache pressure on the chip. > > > This scenario is where I'd place my bets, especially given that the numbers > are based on Xonotic. I benchmarked this patch using Xonotic on Bay Trail as > is and by replacing !brw->is_haswell with !brw->is_baytrail. With ultra and > ultimate levels at medium and high resolutions, the results were all > essentially the same at comparable resolutions and quality levels. Isn't Bay Trail based on Ivy Bridge? > > I don't see any justification to tie this change to just Haswell hardware. > There's all sorts of reasons why doing that sounds like a big mistake. In > fact, another _explanation_ to add to your list is maybe there's another > is_haswell test elsewhere in the driver that is responsible for the > performance anomaly. I had a look at all 'if (brw->is_haswell)' paths but nothing special came out to me. It could as well be some place missing a 'if (brw->is_haswell)'. > >> >> Another possibile explanation is that Haswell has a bug in its sample_d >> logic which causes it to be slow under some conditions, and this >> lower-accuracy DDX computation happens to work around it. If that's the >> case, we might want to consider not using sample_d at all on Haswell, and >> instead calculating the LOD in the shader and using sample_l instead. If >> this is the correct explanation, then that might let us have faster >> performance without sacrificing DDX accuracy. >> >> A final possible explanation for the performance improvement is that >> perhaps for some reason sample_d performs more optimally when the DDX and >> DDY computations have similar accuracies to each other. Before your patch, >> our computation of DDX was more accurate than DDY; your patch decreases the >> accuracy of DDX to match DDY. If this explanation is correct, then a better >> solution would probably be to improve the accuracy of DDY to make it >> comparable to DDX, rather than the other way around. >> >> Before we land this patch, can we do some experiments to try to figure out >> which of these explanations (if any) is correct? >> >>> >>> >>> Signed-off-by: Chia-I Wu <o...@lunarg.com> >>> --- >>> src/mesa/drivers/dri/i965/brw_context.c | 1 + >>> src/mesa/drivers/dri/i965/brw_context.h | 1 + >>> src/mesa/drivers/dri/i965/brw_fs_emit.cpp | 40 >>> ++++++++++++++++++++++++------- >>> src/mesa/drivers/dri/i965/intel_screen.c | 4 ++++ >>> 4 files changed, 38 insertions(+), 8 deletions(-) >>> >>> diff --git a/src/mesa/drivers/dri/i965/brw_context.c >>> b/src/mesa/drivers/dri/i965/brw_context.c >>> index 4fcc9fb..1cdfb9d 100644 >>> --- a/src/mesa/drivers/dri/i965/brw_context.c >>> +++ b/src/mesa/drivers/dri/i965/brw_context.c >>> @@ -470,6 +470,7 @@ brwCreateContext(int api, >>> brw_draw_init( brw ); >>> >>> brw->precompile = driQueryOptionb(&brw->optionCache, >>> "shader_precompile"); >>> + brw->accurate_derivative = driQueryOptionb(&brw->optionCache, >>> "accurate_derivative"); >>> >>> ctx->Const.ContextFlags = 0; >>> if ((flags & __DRI_CTX_FLAG_FORWARD_COMPATIBLE) != 0) >>> diff --git a/src/mesa/drivers/dri/i965/brw_context.h >>> b/src/mesa/drivers/dri/i965/brw_context.h >>> index c566bba..8bfc54a 100644 >>> --- a/src/mesa/drivers/dri/i965/brw_context.h >>> +++ b/src/mesa/drivers/dri/i965/brw_context.h >>> @@ -964,6 +964,7 @@ struct brw_context >>> bool always_flush_cache; >>> bool disable_throttling; >>> bool precompile; >>> + bool accurate_derivative; >>> >>> driOptionCache optionCache; >>> /** @} */ >>> diff --git a/src/mesa/drivers/dri/i965/brw_fs_emit.cpp >>> b/src/mesa/drivers/dri/i965/brw_fs_emit.cpp >>> index bfb3d33..69aeab1 100644 >>> --- a/src/mesa/drivers/dri/i965/brw_fs_emit.cpp >>> +++ b/src/mesa/drivers/dri/i965/brw_fs_emit.cpp >>> @@ -540,7 +540,7 @@ fs_generator::generate_tex(fs_inst *inst, struct >>> brw_reg dst, struct brw_reg src >>> * >>> * arg0: ss0.tl ss0.tr ss0.bl ss0.br ss1.tl ss1.tr ss1.bl ss1.br >>> * >>> - * and we're trying to produce: >>> + * Ideally, we want to produce: >>> * >>> * DDX DDY >>> * dst: (ss0.tr - ss0.tl) (ss0.tl - ss0.bl) >>> @@ -556,24 +556,48 @@ fs_generator::generate_tex(fs_inst *inst, struct >>> brw_reg dst, struct brw_reg src >>> * >>> * For DDX, it ends up being easy: width = 2, horiz=0 gets us the same >>> result >>> * for each pair, and vertstride = 2 jumps us 2 elements after >>> processing a >>> - * pair. But for DDY, it's harder, as we want to produce the pairs >>> swizzled >>> - * between each other. We could probably do it like ddx and swizzle the >>> right >>> - * order later, but bail for now and just produce >>> + * pair. But the ideal approximation of DDX may impose a huge >>> performance >>> + * cost on sample_d. As such, we favor ((ss0.tr - ss0.tl)x4 (ss1.tr - >>> + * ss1.tl)x4) unless the app or the user requests otherwise. >>> + * >>> + * For DDY, it's harder, as we want to produce the pairs swizzled >>> between each >>> + * other. We could probably do it like ddx and swizzle the right order >>> later, >>> + * but bail for now and just produce >>> * ((ss0.tl - ss0.bl)x4 (ss1.tl - ss1.bl)x4) >>> */ >>> void >>> fs_generator::generate_ddx(fs_inst *inst, struct brw_reg dst, struct >>> brw_reg src) >>> { >>> + unsigned vstride, width; >>> + >>> + /* Produce accurate result only when requested. We emit only one >>> + * instruction for either case, but the problem is the result may >>> affect >>> + * how fast sample_d executes. >>> + * >>> + * Since the performance difference is only observed on Haswell, >>> ignore the >>> + * hints on other GENs for now. >>> + */ >>> + if (!brw->is_haswell || >>> + brw->ctx.Hint.FragmentShaderDerivative == GL_NICEST || >>> + brw->accurate_derivative) { >>> + vstride = BRW_VERTICAL_STRIDE_2; >>> + width = BRW_WIDTH_2; >>> + } >>> + else { >>> + vstride = BRW_VERTICAL_STRIDE_4; >>> + width = BRW_WIDTH_4; >>> + } >>> + >>> struct brw_reg src0 = brw_reg(src.file, src.nr, 1, >>> BRW_REGISTER_TYPE_F, >>> - BRW_VERTICAL_STRIDE_2, >>> - BRW_WIDTH_2, >>> + vstride, >>> + width, >>> BRW_HORIZONTAL_STRIDE_0, >>> BRW_SWIZZLE_XYZW, WRITEMASK_XYZW); >>> struct brw_reg src1 = brw_reg(src.file, src.nr, 0, >>> BRW_REGISTER_TYPE_F, >>> - BRW_VERTICAL_STRIDE_2, >>> - BRW_WIDTH_2, >>> + vstride, >>> + width, >>> BRW_HORIZONTAL_STRIDE_0, >>> BRW_SWIZZLE_XYZW, WRITEMASK_XYZW); >>> brw_ADD(p, dst, src0, negate(src1)); >>> diff --git a/src/mesa/drivers/dri/i965/intel_screen.c >>> b/src/mesa/drivers/dri/i965/intel_screen.c >>> index eb6515e..ee08ffd 100644 >>> --- a/src/mesa/drivers/dri/i965/intel_screen.c >>> +++ b/src/mesa/drivers/dri/i965/intel_screen.c >>> @@ -61,6 +61,10 @@ PUBLIC const char __driConfigOptions[] = >>> DRI_CONF_SECTION_END >>> DRI_CONF_SECTION_QUALITY >>> DRI_CONF_FORCE_S3TC_ENABLE("false") >>> + >>> + DRI_CONF_OPT_BEGIN_B(accurate_derivative, "false") >>> + DRI_CONF_DESC(en, "Perform more accurate derivative >>> calculation") >>> + DRI_CONF_OPT_END >>> DRI_CONF_SECTION_END >>> DRI_CONF_SECTION_DEBUG >>> DRI_CONF_NO_RAST("false") >>> -- >>> 1.8.3.1 >>> >>> _______________________________________________ >>> mesa-dev mailing list >>> mesa-dev@lists.freedesktop.org >>> http://lists.freedesktop.org/mailman/listinfo/mesa-dev >> >> >> >> _______________________________________________ >> mesa-dev mailing list >> mesa-dev@lists.freedesktop.org >> http://lists.freedesktop.org/mailman/listinfo/mesa-dev >> > -- o...@lunarg.com _______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev