On Mon, Sep 23, 2013 at 12:09 PM, Chia-I Wu <olva...@gmail.com> wrote: > On Fri, Sep 20, 2013 at 10:50 PM, Paul Berry <stereotype...@gmail.com> wrote: >> On 17 September 2013 19:54, Chia-I Wu <olva...@gmail.com> wrote: >>> >>> Hi Paul, >>> >>> On Mon, Sep 16, 2013 at 3:46 PM, Chia-I Wu <olva...@gmail.com> wrote: >>> > On Sat, Sep 14, 2013 at 5:15 AM, Paul Berry <stereotype...@gmail.com> >>> > wrote: >>> >> On 12 September 2013 22:06, Chia-I Wu <olva...@gmail.com> wrote: >>> >>> >>> >>> From: Chia-I Wu <o...@lunarg.com> >>> >>> >>> >>> Consider only the top-left and top-right pixels to approximate DDX in >>> >>> a >>> >>> 2x2 >>> >>> subspan, unless the application or the user requests a more accurate >>> >>> approximation. This results in a less accurate approximation. >>> >>> However, >>> >>> it >>> >>> improves the performance of Xonotic with Ultra settings by 24.3879% >>> >>> +/- >>> >>> 0.832202% (at 95.0% confidence) on Haswell. No noticeable image >>> >>> quality >>> >>> difference observed. >>> >>> >>> >>> No piglit gpu.tests regressions (tested with v1) >>> >>> >>> >>> I failed to come up with an explanation for the performance >>> >>> difference, as >>> >>> the >>> >>> change does not affect Ivy Bridge. If anyone has the insight, please >>> >>> kindly >>> >>> enlighten me. Performance differences may also be observed on other >>> >>> games >>> >>> that call textureGrad and dFdx. >>> >>> >>> >>> v2: Honor GL_FRAGMENT_SHADER_DERIVATIVE_HINT and add a drirc option. >>> >>> Update >>> >>> comments. >>> >> >>> >> >>> >> I'm not entirely comfortable making a change that has a known negative >>> >> impact on computational accuracy (even one that leads to such an >>> >> impressive >>> >> performance improvement) when we don't have any theories as to why the >>> >> performance improvement happens, or why the improvement doesn't apply >>> >> to Ivy >>> >> Bridge. In my experience, making changes to the codebase without >>> >> understanding why they improve things almost always leads to >>> >> improvements >>> >> that are brittle, since it's likely that the true source of the >>> >> improvement >>> >> is a coincidence that will be wiped out by some future change (or won't >>> >> be >>> >> relevant to client programs other than this particular benchmark). >>> >> Having a >>> >> theory as to why the performance improvement happens would help us be >>> >> confident that we're applying the right fix under the right >>> >> circumstances. >>> > That is how I feel as I've mentioned. I am really glad to have the >>> > discussion. I have done some experiments actually. It is just that >>> > those experiments only tell me what theories are likely to be wrong. >>> > They could not tell me if a theory is right. >>> Do the experiments make sense to you? What other experiments do you >>> want to see conducted? >>> >>> It could be hard to get direct proof without knowing the internal >>> working.. >> >> >> Sorry for the slow reply. We had some internal discussions with the >> hardware architects about this, and it appears that the first theory is >> correct: Haswell has an optimization in its sample_d processing which allows >> it to assume that all pixels in a 2x2 subspan will resolve to the same LOD >> provided that all the gradients in the 2x2 subspan are sufficiently similar >> to each other. There's a register called SAMPLER_MODE which determines how >> similar the gradients have to be in order to trigger the optimization. It >> can be set to values between 0 and 0x1f, where 0 (the default) means "only >> trigger the optimization if the gradients are exactly equal" and 0x1f means >> "trigger the optimization as frequently as possible". Obviously triggering >> the optimization more often reduces the quality of the rendered output >> slightly, because it forces all pixels within a 2x2 subspan to sample from >> the same LOD. >> >> We believe that setting this register to 0x1f should produce an equivalent >> speed-up to your patch, without sacrificing the quality of d/dx when it is >> used for other (non-sample_d) purposes. This approach would have the >> additional advantage that the benefit would apply to any shader that uses >> the sample_d message, regardless of whether or not that shader uses d/dx and >> d/dy to compute its gradients. >> >> Would you mind trying this register to see if it produces an equivalent >> performance benefit in both your micro-benchmark and Xonotic with Ultra >> settings? The register is located at address 07028h in register space MMIO >> 0/2/0. When setting it, the upper 16 bits are a write mask, so to set the >> register to 0 you would store 0x001f0000, and to set it to 0x1f you would >> store 0x001f001f. >> >> Since the SAMPLER_MODE setting allows us to trade off quality vs >> performance, we're also interested to know whether a value less than 0x1f is >> sufficient to produce the performance improvement in Xonotic--it would be >> nice if we could find a "sweet spot" for this setting that produces the >> performance improvement we need without sacrificing too much quality. > Great finding! I will see if setting the register helps. Changing the register does not work very well. Attached is a patch to drm-intel-nightly branch of the kernel that exposes SAMPLER_MODE in debugfs, if you want to play with it.
Xonotic first. The frame rate peaks at ~100fps when SAMPLER_MODE is set to 0x17. This is the same number the DDX change can do. Setting SAMPLER_MODE to a smaller value results in a lower frame rate. Setting it to a higher value also results in a lower frame rate, and there are easily noticeable artifacts (likely due to more texels are sampled from LOD 0). With modified piglit arb_shader_texture_lod-texgrad test, you can see the artifacts when SAMPLER_MODE is 0x12, and it gets worse with larger values. I took a screen shot with the register set to 0x17. I also took a screen shot with the DDX change applied to see the interaction between them. > > Since my goal was to figure out why sample_d is slower than sample and > the DDX change worked, that made me wonder if sample assumes all > pixels in a subspan resolve to the same LOD. To understand that > better, I modified piglit arb_shader_texture_lod-texgrad test to scale > and rotate the triangle around the X-axis. This was to make the top > and bottom rows have a better chance to have different gradients. > > I then ran the test three times with > > a. unmodified driver (texture2D-texture2DGradARB.png) > b. patched driver (texture2D-texture2DGradARB-coarse-granularity.png) > c. both triangles rendered with texture2D (texture2D-texture2D.png) > > It is hard to tell the differences from the snapshots. But when you > run ImageMagick compare on > > - c. and a. (before.png) > - c. and b. (after.png) > > you can see that before the DDX change, the rendered triangles have > different colors every other row. After the DDX change, there is no > difference. With this observation, I believe sample executes at 2x2 > granularity. Not that texture2D and texture2DGrad have to behave > exactly the same, but it is good to for them to be consistent. > > >> >> Finally, do you have any ability to see whether the Windows driver sets this >> register, and if so what it sets it to? That would provide some nice >> confirmation that we aren't barking up the wrong tree here. >> >> >> As a follow-up task, I'm planning to write a patch that improves the quality >> of our d/dy calculation to be comparable to d/dx. Based on our current >> understanding of what's going on, I suspect that my patch may have a slight >> effect on the SAMPLER_MODE sweet spot. I'll try to get that patch out >> today, and I'll Cc you so that you can try it out. >> >> Thanks so much for finding this, Chia-I, and thanks for your patience as >> we've been sorting through trying to find the true explanation for the >> performance improvement. >> > [snipped] > > > -- > o...@lunarg.com -- o...@lunarg.com
0001-drm-i915-expose-register-SAMPLE_MODE-in-debugfs.patch
Description: Binary data
<<attachment: SAMPLER_MODE_0x17.png>>
<<attachment: SAMPLER_MODE_0x17-coarse-granularity.png>>
_______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev