On 09/20/2013 09:50 AM, Paul Berry wrote: > On 17 September 2013 19:54, Chia-I Wu <olva...@gmail.com > <mailto:olva...@gmail.com>> wrote: > > Hi Paul, > > On Mon, Sep 16, 2013 at 3:46 PM, Chia-I Wu <olva...@gmail.com > <mailto:olva...@gmail.com>> wrote: > > On Sat, Sep 14, 2013 at 5:15 AM, Paul Berry > <stereotype...@gmail.com <mailto:stereotype...@gmail.com>> wrote: > >> On 12 September 2013 22:06, Chia-I Wu <olva...@gmail.com > <mailto:olva...@gmail.com>> wrote: > >>> > >>> From: Chia-I Wu <o...@lunarg.com <mailto:o...@lunarg.com>> > >>> > >>> Consider only the top-left and top-right pixels to approximate > DDX in a > >>> 2x2 > >>> subspan, unless the application or the user requests a more accurate > >>> approximation. This results in a less accurate approximation. > However, > >>> it > >>> improves the performance of Xonotic with Ultra settings by > 24.3879% +/- > >>> 0.832202% (at 95.0% confidence) on Haswell. No noticeable image > quality > >>> difference observed. > >>> > >>> No piglit gpu.tests regressions (tested with v1) > >>> > >>> I failed to come up with an explanation for the performance > difference, as > >>> the > >>> change does not affect Ivy Bridge. If anyone has the insight, > please > >>> kindly > >>> enlighten me. Performance differences may also be observed on > other games > >>> that call textureGrad and dFdx. > >>> > >>> v2: Honor GL_FRAGMENT_SHADER_DERIVATIVE_HINT and add a drirc option. > >>> Update > >>> comments. > >> > >> > >> I'm not entirely comfortable making a change that has a known > negative > >> impact on computational accuracy (even one that leads to such an > impressive > >> performance improvement) when we don't have any theories as to > why the > >> performance improvement happens, or why the improvement doesn't > apply to Ivy > >> Bridge. In my experience, making changes to the codebase without > >> understanding why they improve things almost always leads to > improvements > >> that are brittle, since it's likely that the true source of the > improvement > >> is a coincidence that will be wiped out by some future change (or > won't be > >> relevant to client programs other than this particular > benchmark). Having a > >> theory as to why the performance improvement happens would help us be > >> confident that we're applying the right fix under the right > circumstances. > > That is how I feel as I've mentioned. I am really glad to have the > > discussion. I have done some experiments actually. It is just that > > those experiments only tell me what theories are likely to be wrong. > > They could not tell me if a theory is right. > Do the experiments make sense to you? What other experiments do you > want to see conducted? > > It could be hard to get direct proof without knowing the internal > working.. > > > Sorry for the slow reply. We had some internal discussions with the > hardware architects about this, and it appears that the first theory is > correct: Haswell has an optimization in its sample_d processing which > allows it to assume that all pixels in a 2x2 subspan will resolve to the > same LOD provided that all the gradients in the 2x2 subspan are > sufficiently similar to each other. There's a register called > SAMPLER_MODE which determines how similar the gradients have to be in > order to trigger the optimization. It can be set to values between 0 > and 0x1f, where 0 (the default) means "only trigger the optimization if > the gradients are exactly equal" and 0x1f means "trigger the > optimization as frequently as possible". Obviously triggering the > optimization more often reduces the quality of the rendered output > slightly, because it forces all pixels within a 2x2 subspan to sample > from the same LOD. > > We believe that setting this register to 0x1f should produce an > equivalent speed-up to your patch, without sacrificing the quality of > d/dx when it is used for other (non-sample_d) purposes. This approach > would have the additional advantage that the benefit would apply to any > shader that uses the sample_d message, regardless of whether or not that > shader uses d/dx and d/dy to compute its gradients. > > Would you mind trying this register to see if it produces an equivalent > performance benefit in both your micro-benchmark and Xonotic with Ultra > settings? The register is located at address 07028h in register space > MMIO 0/2/0. When setting it, the upper 16 bits are a write mask, so to > set the register to 0 you would store 0x001f0000, and to set it to 0x1f > you would store 0x001f001f. > > Since the SAMPLER_MODE setting allows us to trade off quality vs > performance, we're also interested to know whether a value less than > 0x1f is sufficient to produce the performance improvement in Xonotic--it > would be nice if we could find a "sweet spot" for this setting that > produces the performance improvement we need without sacrificing too > much quality.
How about if we just give a driconf option to adjust it. Then gamers can make their own choice. For applications where it know it makes a big difference, we can provide a default non-0 value in the system driconf. > Finally, do you have any ability to see whether the Windows driver sets > this register, and if so what it sets it to? That would provide some > nice confirmation that we aren't barking up the wrong tree here. > > As a follow-up task, I'm planning to write a patch that improves the > quality of our d/dy calculation to be comparable to d/dx. Based on our > current understanding of what's going on, I suspect that my patch may > have a slight effect on the SAMPLER_MODE sweet spot. I'll try to get > that patch out today, and I'll Cc you so that you can try it out. > > Thanks so much for finding this, Chia-I, and thanks for your patience as > we've been sorting through trying to find the true explanation for the > performance improvement. _______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev