On Mon, Nov 2, 2015 at 7:08 PM, Roland Scheidegger <srol...@vmware.com> wrote: > Am 02.11.2015 um 14:46 schrieb Oded Gabbay: >> On Thu, Oct 29, 2015 at 9:37 PM, Roland Scheidegger <srol...@vmware.com> >> wrote: >>> Am 29.10.2015 um 19:44 schrieb Oded Gabbay: >>>> On Thu, Oct 29, 2015 at 5:31 PM, Roland Scheidegger <srol...@vmware.com> >>>> wrote: >>>>> Am 29.10.2015 um 12:27 schrieb Oded Gabbay: >>>>>> Hi Roland, Jose >>>>>> >>>>>> I wanted to bring a problem I found to your attention, and discuss >>>>>> about it and ways to solve it. >>>>>> >>>>>> I'm working on regressions of piglit gpu.py between x86-64 and >>>>>> ppc64le, when running with llvmpipe. >>>>>> >>>>>> One of the regressions manifests itself in 2 tests, >>>>>> clip-distance-bulk-copy and clip-distance-itemized-copy >>>>> There's actually a third one, interface-vs-unnamed-to-fs-unnamed, which >>>>> fails the same (and it's easier to debug...). >>>>> >>>> Yes, you are correct of course. It was also fixed by my change. >>>> >>>>> >>>>>> >>>>>> What happens in ppc64le is that *some* of the interpolated per-vertex >>>>>> values (clip-distance values) that constitute the input into the >>>>>> fragment shader, are not calculated according to the required accuracy >>>>>> of the test (which is an error/distance of less than 1e-6) >>>>>> >>>>>> It took a while, but I believe I found the reason. In general, when >>>>>> running on ppc64le, the code path that is taken at >>>>>> lp_build_interp_soa_init() is doing bld->simple_interp = FALSE, which >>>>>> means using coeffs_init() + attribs_update(). That's in contrast with >>>>>> x86-64, where bld->simple_interp = TRUE, which means the code will use >>>>>> coeffs_init_simple() + attribs_update_simple() >>>>> Note this isn't really about x86-64 per se. If you don't have AVX on a >>>>> x86-64 box, the other path will be taken too (and the test fail as well). >>>>> >>>> makes sense. >>>> >>>>>> >>>>>> In specific, the problem lies in how the coeffs are calculated inside >>>>>> coeffs_init. To simply put it, they are only *actually* calculated >>>>>> correctly for the first quad. The coeffs are later *updated* for the >>>>>> other quads, using dadx/dady and dadq. >>>>>> >>>>>> ------ >>>>>> side-note: >>>>>> I believe you even documented it in coeffs_init(): >>>>>> /* >>>>>> * a = a0 + (x * dadx + y * dady) >>>>>> * a0aos is the attrib value at top left corner of stamp >>>>>> */ >>>>>> ------ >>>>>> >>>>>> The root cause for that, I believe, is because coeffs_init does *not* >>>>>> take into account the different x/y offsets for the other quads. In >>>>>> contrast, on x86-64, the code uses a function called calc_offset(), >>>>>> which does take the different x/y offsets into account. >>>>>> >>>>>> Please look at this definition (from lp_bld_interp.c) : >>>>>> >>>>>> static const unsigned char quad_offset_x[16] = {0, 1, 0, 1, 2, 3, 2, >>>>>> 3, 0, 1, 0, 1, 2, 3, 2, 3}; >>>>>> static const unsigned char quad_offset_y[16] = {0, 0, 1, 1, 0, 0, 1, >>>>>> 1, 2, 2, 3, 3, 2, 2, 3, 3}; >>>>>> >>>>>> Now, look at how coeffs_init() uses them: >>>>>> >>>>>> for (i = 0; i < coeff_bld->type.length; i++) { >>>>>> LLVMValueRef nr = lp_build_const_int32(gallivm, i); >>>>>> LLVMValueRef pixxf = lp_build_const_float(gallivm, >>>>>> quad_offset_x[i]); >>>>>> LLVMValueRef pixyf = lp_build_const_float(gallivm, >>>>>> quad_offset_y[i]); >>>>>> pixoffx = LLVMBuildInsertElement(builder, pixoffx, pixxf, nr, ""); >>>>>> pixoffy = LLVMBuildInsertElement(builder, pixoffy, pixyf, nr, ""); >>>>>> } >>>>>> >>>>>> So, in ppc64le, where coeff_bld->type.length == 4, we *always* take >>>>>> the first four values from the above arrays. The code *never* >>>>>> calculates the coeffs based on the other offsets. >>>>>> >>>>>> Admittedly, that's *almost* always works and I doubt you will see a >>>>>> difference in display. However, in the tests I mentioned above, it >>>>>> creates, for some pixels, a bigger error than what is allowed. >>>>> Well, I'm not sure what error actually is allowed... GL of course is >>>>> vague (the test just seems to be built so the error allowed actually >>>>> passes on real hw). >>>>> >>>>> I don't think there's an actual bug with the code. What it does in >>>>> coeffs_init, is simply calculate the dadx + dady values from one pixel >>>>> to the next (within a quad). >>>>> Then, in attrib_update, it simply uses these values (which are constant >>>>> for all quads, as the position of the pixels within a quad doesn't >>>>> change) to calculate the actual dadq value based on the actual pixel >>>>> offsets, and add that to the (also done in coeffs_init) starting value >>>>> of each quad. >>>>> >>>>> So, the code is correct. However, it still gives different results >>>>> compared to the other method, and the reason is just float math. This >>>>> method does two interpolations (one for the start of the quad, the other >>>>> per pixel), whereas the other method does just one. And this is why the >>>>> results are different, because the results just aren't exact. >>>>> >>>> Yeah, I agree. I also wouldn't say it is a bug per-se, but it is a >>>> method of calculation that produces less accurate results then the >>>> other method (the simple path). >>>> >>>>> >>>>>> >>>>>> The first thing that came into my mind was to change, in >>>>>> lp_build_interp_soa_init(): >>>>>> >>>>>> "if (coeff_type.length > 4)" >>>>>> to this: >>>>>> "if (coeff_type.length >= 4)" >>>>>> >>>>>> When I did that, I saw it fixes the above tests and didn't cause any >>>>>> regressions on ppc64le and/or on x86-64. >>>>>> >>>>>> However, I wanted to ask you guys if you remember why you allowed only >>>>>> vectors with size above 4 to use the "simple" route ? >>>>>> Was there any performance hit ? >>>>> Some history, initially there actually only was the two-stage path (well >>>>> actually it was slightly different as back then the pixel shader wasn't >>>>> even executed in a loop but the idea was the same). The one-stage path >>>>> was actually done when extending to 256bit vectors (or maybe when >>>>> introducing the fs loop, I forgot...) because it was easier to code. >>>>> Albeit both paths were fixed to work both with 128bit and 256bit >>>>> vectors. And indeed, the reason why one is used over the other is just >>>>> performance now (the two-stage approach has a higher initial cost but >>>>> lower per-loop cost, so it's faster when the loop is run four times, but >>>>> not when it's only run two times). >>>>> I was actually just thinking about nuking the two-stage approach >>>>> yesterday as it's quite annoying the results are different depending on >>>>> the vector size. But IIRC the performance difference was quite >>>>> measurable sometimes. >>>>> >>>> That's what I thought. >>>> I'm willing to test the performance on H/W if you can point me to a >>>> relevant test. >>>> By running glxgears, I can see the FPS drops by about 5 FPS (from 605 >>>> to 600), when using the simple method. >>>> Also, it seemed to me, but I didn't measure it, that running piglit >>>> gpu.py was slower by some margin. >>> glxgears is a bit _too_ simple... >>> Something simple which uses "traditional" mul+add shader (2 interpolated >>> attributes, one of them for texture coords) or doing something similarly >>> simple might be a good test. gloss demo, or openarena - desktop >>> compositors would of course be nice, but I don't think they come with >>> benchmark mode... >>> >>> Roland >>> >> >> Hi Roland, >> I run a couple of tests on ppc64le machine and my haswell laptop with >> LP_NATIVE_VECTOR_WIDTH=128: >> >> - glxspheres, on ppc64le >> - original code: 4.892745317 frames/sec 5.460303857 Mpixels/sec >> - with my change: 4.932083873 frames/sec 5.504205571 Mpixels/sec >> - difference is 0.8% *for* the change >> >> - glxspheres, on x86-64 >> - original code: 20.16418809 frames/sec 22.50323395 Mpixels/sec >> - with my change: 20.31328989 frames/sec 22.66963152 Mpixels/sec >> - difference is 0.74% *for* the change >> >> - glmark2, on ppc64le: >> - original code: score of 58 >> - with my change: score of 57 >> >> - glmark2, on x86-64: >> - original code: score of 175 >> - with my change: score of 167 >> - difference of 4.5% *against* the change >> >> - phoronix gui-toolkits (don't know how relevant it is), only on ppc64le >> - very similar results, some tests better with original code, some are >> worse >> >> Unfortunately, couldn't manage to run OpenArena on my machines with >> latest mesa and llvmpipe. Kept getting an error of: "Your system >> currently is not capable of hardware accelerated 3D. Therefore >> openarena cannot run." > Sounds like it couldn't load the driver - IIRC it will refuse falling > back to indirect rendering. Perhaps using the wrong binary (32bit vs. > 64bit) without having both loadable 32bit and 64bit drivers? > So airlied helped me and I was able to run OpenArena (problem was with /usr/share/opengl-games-utils/opengl-game-functions.sh)
Here are the results: ppc64le: - original code: 3398 frames 1719.0 seconds 2.0 fps 255.0/505.9/2773.0/0.0 ms - with my change: 3398 frames 1690.4 seconds 2.0 fps 241.0/497.5/2563.0/0.2 ms - 29 seconds faster with single stage, which is about 2% faster x86-64: - original code: 3398 frames 239.6 seconds 14.2 fps 38.0/70.5/719.0/14.6 ms - with my change: 3398 frames 244.4 seconds 13.9 fps 38.0/71.9/697.0/14.3 ms - 0.3 fps slower with single stage (about 2%) Bottom line, nothing too exciting. > But in any case, IIRC the 4.5% deficit you got with glmark2 is about > roughly what I got when I did some measurements ages ago with some other > benchmarks. Not a complete dealbreaker but still quite significant. OTOH > we don't really care all that much about 4-wide vectors on x86-64 > anymore. Interesting though it doesn't really make a difference on ppc64 > (last I checked I believe there were actually significant differences in > spilling registers overall with the different interpolation methods with > x86-64 and stuff like that could change from backend to backend as well > as different llvm versions) > >> >> What do you think ? >> Are those benchmarks comprehensive enough ? >> It seems there is almost no impact on ppc64le, while on x86-64 there >> is a minor impact. >> If you have something specific I can run, I will be happy to do it. > > > Sounds like it might be worth to just always enable the "simple" path, > just to get rid of the different rendering results. Albeit at this point > I'd defininitely like to keep the old code around. > > Roland OK, so I'll send a patch. Thanks, Oded > >> >> Oded >> >>> >>>> >>>>>> >>>>>> I think that to fix coeffs_init()+attribs_update() is to basically >>>>>> re-design them. Otherwise, I would have tried to fix them first. >>>>> >>>>> Well, even the one-stage approach is basically crap. The reason is that >>>>> the start value is always based on the top left corner of the screen >>>>> (coords 0,0 or actually 0.5,0.5). This means if you have some triangle >>>>> far away from the screen origin, but with steep gradients, the errors >>>>> will be huge (but there's no piglit tests which would hit this). >>>>> Modern hw has abandoned this way of interpolation by using barycentric >>>>> interpolation. But it's quite complex to do fast (the hw integrates this >>>>> into rasterization, since the barycentric weights are the same as the >>>>> edge functions). >>>>> Albeit it would be possible to improve this particular problem without >>>>> using barycentric interpolation (could for instance use one corner of >>>>> the triangle as start value instead of 0,0 but adds complexity as well). >>>>> But the coeffs_init/update functions as such are unfixable as it's just >>>>> float math precision issues (well you could try using fma instead of >>>>> mul+add, use double precision, ...) but the fundamental issue that it's >>>>> just less precise doing two-step interpolation instead of one-step >>>>> remains. >>>>> >>>>> >>>>>> >>>>>> In any case, I would love to hear your feedback and to discuss this >>>>>> further. >>>>>> >>>>> >>>>> Do you see any specific problem outside the 3 failing piglit tests? >>>>> Because as far as I can tell, the error allowed in the tests was really >>>>> just chosen so it works on ordinary hw >>>>> (interface-vs-unnamed-to-fs-unnamed actually has higher allowed error >>>>> than the other two). >>>>> So, I was reluctant to change it just because of these piglits, but this >>>>> isn't set in stone... >>>>> >>>>> Roland >>>>> >>>> No, I couldn't see any other problem that relates to this, as I >>>> understood the error was so small (the results of the interpolation >>>> were off by about 5e-7 then x86-64). >>>> I also run some OpenGL demo that uses gl_ClipDistance and I didn't see >>>> any visual problem. >>>> >>>> However, I would hate to keep the situation as is, meaning the test >>>> passes on x86-64 and fails on ppc64le. >>>> If we agree that there is no actual bug in the code, then another >>>> possibility is to change the failure criteria. Because we know the >>>> maximum deviation between the two methods is 5e-7, we can add 5e-7 (or >>>> 0.5e-6) to the criteria of the three above mentioned tests. That would >>>> make them pass even with the two-stages approach. >>>> >>>> What do you think ? >>>> >>>> Oded >>>> >>> > _______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev