On Mon, Sep 16, 2013 at 4:12 PM, Chia-I Wu <olva...@gmail.com> wrote: > On Mon, Sep 16, 2013 at 3:50 AM, Mark Mueller <markkmuel...@gmail.com> wrote: >> >> >> >> On Fri, Sep 13, 2013 at 2:15 PM, Paul Berry <stereotype...@gmail.com> wrote: >>> >>> On 12 September 2013 22:06, Chia-I Wu <olva...@gmail.com> wrote: >>>> >>>> From: Chia-I Wu <o...@lunarg.com> >>>> >>>> Consider only the top-left and top-right pixels to approximate DDX in a >>>> 2x2 >>>> subspan, unless the application or the user requests a more accurate >>>> approximation. This results in a less accurate approximation. However, >>>> it >>>> improves the performance of Xonotic with Ultra settings by 24.3879% +/- >>>> 0.832202% (at 95.0% confidence) on Haswell. No noticeable image quality >>>> difference observed. >>>> >>>> No piglit gpu.tests regressions (tested with v1) >>>> >>>> I failed to come up with an explanation for the performance difference, >>>> as the >>>> change does not affect Ivy Bridge. If anyone has the insight, please >>>> kindly >>>> enlighten me. Performance differences may also be observed on other >>>> games >>>> that call textureGrad and dFdx. >>>> >>>> v2: Honor GL_FRAGMENT_SHADER_DERIVATIVE_HINT and add a drirc option. >>>> Update >>>> comments. >>> >>> >>> I'm not entirely comfortable making a change that has a known negative >>> impact on computational accuracy (even one that leads to such an impressive >>> performance improvement) when we don't have any theories as to why the >>> performance improvement happens, or why the improvement doesn't apply to Ivy >>> Bridge. In my experience, making changes to the codebase without >>> understanding why they improve things almost always leads to improvements >>> that are brittle, since it's likely that the true source of the improvement >>> is a coincidence that will be wiped out by some future change (or won't be >>> relevant to client programs other than this particular benchmark). Having a >>> theory as to why the performance improvement happens would help us be >>> confident that we're applying the right fix under the right circumstances. >> There's another angle to approach this and that is to develop a simple test >> case that will show the different results across a range of computational >> accuracy and run the test on proprietary drivers for the same hardware to >> determine what settings they are using. > Yes, I have a little test. On Windows, rendering with texture2D() or > textureGrad() does not have a noticeable impact on the performance. > But I am not sure how to change dFdx() accuracy from the shaders. On > Linux, I did that by modifying the driver. > >> >>> >>> >>> For example, here's one theory as to why we might be seeing an >>> improvement: perhaps Haswell's sample_d processing is smart enough to >>> realize that when all the gradient values within a sub-span are the same, >>> that means that all of the sampling for the sub-span will come from the same >>> LOD, and that allows it to short-cut some expensive step in the LOD >>> calculation. Perhaps the same improvement isn't seen on Ivy Bridge because >>> Ivy Bridge's sample_d processing logic is less sophisticated, so it's unable >>> to perform the optimization. If this is the case, then conditioning the >>> optimization on brw->is_haswell (as you've done) makes sense. >>> >>> Another possible explanation for the Haswell vs Ivy Bridge difference is >>> that perhaps Ivy Bridge, being a lower-performing chip, has other >>> bottlenecks that make the optimization irrelevant for this particular >>> benchmark, but potentially still useful for other benchmarks. For instance, >>> maybe when this benchmark executes on Ivy Bridge, the texture that's being >>> sampled from is located in sufficiently distant memory that optimizing the >>> sample_d's memory accesses makes no difference, since the bottleneck is the >>> speed with which the texture can be read into cache, rather than the speed >>> of operation of sample_d. If this explanation is correct, then it might be >>> worth applying the optimization to both Ivy Bridge and Haswell (and perhaps >>> Sandy Bridge as well), since it might conceivably benefit those other chips >>> when running applications that place less cache pressure on the chip. >> >> >> This scenario is where I'd place my bets, especially given that the numbers >> are based on Xonotic. I benchmarked this patch using Xonotic on Bay Trail as >> is and by replacing !brw->is_haswell with !brw->is_baytrail. With ultra and >> ultimate levels at medium and high resolutions, the results were all >> essentially the same at comparable resolutions and quality levels. > Isn't Bay Trail based on Ivy Bridge? For Bay Trail, this might help you
http://lists.freedesktop.org/archives/mesa-dev/2013-September/044288.html if you are interested. >> >> I don't see any justification to tie this change to just Haswell hardware. >> There's all sorts of reasons why doing that sounds like a big mistake. In >> fact, another _explanation_ to add to your list is maybe there's another >> is_haswell test elsewhere in the driver that is responsible for the >> performance anomaly. > I had a look at all 'if (brw->is_haswell)' paths but nothing special > came out to me. It could as well be some place missing a 'if > (brw->is_haswell)'. > >> >>> >>> Another possibile explanation is that Haswell has a bug in its sample_d >>> logic which causes it to be slow under some conditions, and this >>> lower-accuracy DDX computation happens to work around it. If that's the >>> case, we might want to consider not using sample_d at all on Haswell, and >>> instead calculating the LOD in the shader and using sample_l instead. If >>> this is the correct explanation, then that might let us have faster >>> performance without sacrificing DDX accuracy. >>> >>> A final possible explanation for the performance improvement is that >>> perhaps for some reason sample_d performs more optimally when the DDX and >>> DDY computations have similar accuracies to each other. Before your patch, >>> our computation of DDX was more accurate than DDY; your patch decreases the >>> accuracy of DDX to match DDY. If this explanation is correct, then a better >>> solution would probably be to improve the accuracy of DDY to make it >>> comparable to DDX, rather than the other way around. >>> >>> Before we land this patch, can we do some experiments to try to figure out >>> which of these explanations (if any) is correct? >>> >>>> >>>> >>>> Signed-off-by: Chia-I Wu <o...@lunarg.com> >>>> --- >>>> src/mesa/drivers/dri/i965/brw_context.c | 1 + >>>> src/mesa/drivers/dri/i965/brw_context.h | 1 + >>>> src/mesa/drivers/dri/i965/brw_fs_emit.cpp | 40 >>>> ++++++++++++++++++++++++------- >>>> src/mesa/drivers/dri/i965/intel_screen.c | 4 ++++ >>>> 4 files changed, 38 insertions(+), 8 deletions(-) >>>> >>>> diff --git a/src/mesa/drivers/dri/i965/brw_context.c >>>> b/src/mesa/drivers/dri/i965/brw_context.c >>>> index 4fcc9fb..1cdfb9d 100644 >>>> --- a/src/mesa/drivers/dri/i965/brw_context.c >>>> +++ b/src/mesa/drivers/dri/i965/brw_context.c >>>> @@ -470,6 +470,7 @@ brwCreateContext(int api, >>>> brw_draw_init( brw ); >>>> >>>> brw->precompile = driQueryOptionb(&brw->optionCache, >>>> "shader_precompile"); >>>> + brw->accurate_derivative = driQueryOptionb(&brw->optionCache, >>>> "accurate_derivative"); >>>> >>>> ctx->Const.ContextFlags = 0; >>>> if ((flags & __DRI_CTX_FLAG_FORWARD_COMPATIBLE) != 0) >>>> diff --git a/src/mesa/drivers/dri/i965/brw_context.h >>>> b/src/mesa/drivers/dri/i965/brw_context.h >>>> index c566bba..8bfc54a 100644 >>>> --- a/src/mesa/drivers/dri/i965/brw_context.h >>>> +++ b/src/mesa/drivers/dri/i965/brw_context.h >>>> @@ -964,6 +964,7 @@ struct brw_context >>>> bool always_flush_cache; >>>> bool disable_throttling; >>>> bool precompile; >>>> + bool accurate_derivative; >>>> >>>> driOptionCache optionCache; >>>> /** @} */ >>>> diff --git a/src/mesa/drivers/dri/i965/brw_fs_emit.cpp >>>> b/src/mesa/drivers/dri/i965/brw_fs_emit.cpp >>>> index bfb3d33..69aeab1 100644 >>>> --- a/src/mesa/drivers/dri/i965/brw_fs_emit.cpp >>>> +++ b/src/mesa/drivers/dri/i965/brw_fs_emit.cpp >>>> @@ -540,7 +540,7 @@ fs_generator::generate_tex(fs_inst *inst, struct >>>> brw_reg dst, struct brw_reg src >>>> * >>>> * arg0: ss0.tl ss0.tr ss0.bl ss0.br ss1.tl ss1.tr ss1.bl ss1.br >>>> * >>>> - * and we're trying to produce: >>>> + * Ideally, we want to produce: >>>> * >>>> * DDX DDY >>>> * dst: (ss0.tr - ss0.tl) (ss0.tl - ss0.bl) >>>> @@ -556,24 +556,48 @@ fs_generator::generate_tex(fs_inst *inst, struct >>>> brw_reg dst, struct brw_reg src >>>> * >>>> * For DDX, it ends up being easy: width = 2, horiz=0 gets us the same >>>> result >>>> * for each pair, and vertstride = 2 jumps us 2 elements after >>>> processing a >>>> - * pair. But for DDY, it's harder, as we want to produce the pairs >>>> swizzled >>>> - * between each other. We could probably do it like ddx and swizzle the >>>> right >>>> - * order later, but bail for now and just produce >>>> + * pair. But the ideal approximation of DDX may impose a huge >>>> performance >>>> + * cost on sample_d. As such, we favor ((ss0.tr - ss0.tl)x4 (ss1.tr - >>>> + * ss1.tl)x4) unless the app or the user requests otherwise. >>>> + * >>>> + * For DDY, it's harder, as we want to produce the pairs swizzled >>>> between each >>>> + * other. We could probably do it like ddx and swizzle the right order >>>> later, >>>> + * but bail for now and just produce >>>> * ((ss0.tl - ss0.bl)x4 (ss1.tl - ss1.bl)x4) >>>> */ >>>> void >>>> fs_generator::generate_ddx(fs_inst *inst, struct brw_reg dst, struct >>>> brw_reg src) >>>> { >>>> + unsigned vstride, width; >>>> + >>>> + /* Produce accurate result only when requested. We emit only one >>>> + * instruction for either case, but the problem is the result may >>>> affect >>>> + * how fast sample_d executes. >>>> + * >>>> + * Since the performance difference is only observed on Haswell, >>>> ignore the >>>> + * hints on other GENs for now. >>>> + */ >>>> + if (!brw->is_haswell || >>>> + brw->ctx.Hint.FragmentShaderDerivative == GL_NICEST || >>>> + brw->accurate_derivative) { >>>> + vstride = BRW_VERTICAL_STRIDE_2; >>>> + width = BRW_WIDTH_2; >>>> + } >>>> + else { >>>> + vstride = BRW_VERTICAL_STRIDE_4; >>>> + width = BRW_WIDTH_4; >>>> + } >>>> + >>>> struct brw_reg src0 = brw_reg(src.file, src.nr, 1, >>>> BRW_REGISTER_TYPE_F, >>>> - BRW_VERTICAL_STRIDE_2, >>>> - BRW_WIDTH_2, >>>> + vstride, >>>> + width, >>>> BRW_HORIZONTAL_STRIDE_0, >>>> BRW_SWIZZLE_XYZW, WRITEMASK_XYZW); >>>> struct brw_reg src1 = brw_reg(src.file, src.nr, 0, >>>> BRW_REGISTER_TYPE_F, >>>> - BRW_VERTICAL_STRIDE_2, >>>> - BRW_WIDTH_2, >>>> + vstride, >>>> + width, >>>> BRW_HORIZONTAL_STRIDE_0, >>>> BRW_SWIZZLE_XYZW, WRITEMASK_XYZW); >>>> brw_ADD(p, dst, src0, negate(src1)); >>>> diff --git a/src/mesa/drivers/dri/i965/intel_screen.c >>>> b/src/mesa/drivers/dri/i965/intel_screen.c >>>> index eb6515e..ee08ffd 100644 >>>> --- a/src/mesa/drivers/dri/i965/intel_screen.c >>>> +++ b/src/mesa/drivers/dri/i965/intel_screen.c >>>> @@ -61,6 +61,10 @@ PUBLIC const char __driConfigOptions[] = >>>> DRI_CONF_SECTION_END >>>> DRI_CONF_SECTION_QUALITY >>>> DRI_CONF_FORCE_S3TC_ENABLE("false") >>>> + >>>> + DRI_CONF_OPT_BEGIN_B(accurate_derivative, "false") >>>> + DRI_CONF_DESC(en, "Perform more accurate derivative >>>> calculation") >>>> + DRI_CONF_OPT_END >>>> DRI_CONF_SECTION_END >>>> DRI_CONF_SECTION_DEBUG >>>> DRI_CONF_NO_RAST("false") >>>> -- >>>> 1.8.3.1 >>>> >>>> _______________________________________________ >>>> mesa-dev mailing list >>>> mesa-dev@lists.freedesktop.org >>>> http://lists.freedesktop.org/mailman/listinfo/mesa-dev >>> >>> >>> >>> _______________________________________________ >>> mesa-dev mailing list >>> mesa-dev@lists.freedesktop.org >>> http://lists.freedesktop.org/mailman/listinfo/mesa-dev >>> >> > > > > -- > o...@lunarg.com -- o...@lunarg.com _______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev