Am 29.10.2015 um 19:44 schrieb Oded Gabbay: > On Thu, Oct 29, 2015 at 5:31 PM, Roland Scheidegger <srol...@vmware.com> > wrote: >> Am 29.10.2015 um 12:27 schrieb Oded Gabbay: >>> Hi Roland, Jose >>> >>> I wanted to bring a problem I found to your attention, and discuss >>> about it and ways to solve it. >>> >>> I'm working on regressions of piglit gpu.py between x86-64 and >>> ppc64le, when running with llvmpipe. >>> >>> One of the regressions manifests itself in 2 tests, >>> clip-distance-bulk-copy and clip-distance-itemized-copy >> There's actually a third one, interface-vs-unnamed-to-fs-unnamed, which >> fails the same (and it's easier to debug...). >> > Yes, you are correct of course. It was also fixed by my change. > >> >>> >>> What happens in ppc64le is that *some* of the interpolated per-vertex >>> values (clip-distance values) that constitute the input into the >>> fragment shader, are not calculated according to the required accuracy >>> of the test (which is an error/distance of less than 1e-6) >>> >>> It took a while, but I believe I found the reason. In general, when >>> running on ppc64le, the code path that is taken at >>> lp_build_interp_soa_init() is doing bld->simple_interp = FALSE, which >>> means using coeffs_init() + attribs_update(). That's in contrast with >>> x86-64, where bld->simple_interp = TRUE, which means the code will use >>> coeffs_init_simple() + attribs_update_simple() >> Note this isn't really about x86-64 per se. If you don't have AVX on a >> x86-64 box, the other path will be taken too (and the test fail as well). >> > makes sense. > >>> >>> In specific, the problem lies in how the coeffs are calculated inside >>> coeffs_init. To simply put it, they are only *actually* calculated >>> correctly for the first quad. The coeffs are later *updated* for the >>> other quads, using dadx/dady and dadq. >>> >>> ------ >>> side-note: >>> I believe you even documented it in coeffs_init(): >>> /* >>> * a = a0 + (x * dadx + y * dady) >>> * a0aos is the attrib value at top left corner of stamp >>> */ >>> ------ >>> >>> The root cause for that, I believe, is because coeffs_init does *not* >>> take into account the different x/y offsets for the other quads. In >>> contrast, on x86-64, the code uses a function called calc_offset(), >>> which does take the different x/y offsets into account. >>> >>> Please look at this definition (from lp_bld_interp.c) : >>> >>> static const unsigned char quad_offset_x[16] = {0, 1, 0, 1, 2, 3, 2, >>> 3, 0, 1, 0, 1, 2, 3, 2, 3}; >>> static const unsigned char quad_offset_y[16] = {0, 0, 1, 1, 0, 0, 1, >>> 1, 2, 2, 3, 3, 2, 2, 3, 3}; >>> >>> Now, look at how coeffs_init() uses them: >>> >>> for (i = 0; i < coeff_bld->type.length; i++) { >>> LLVMValueRef nr = lp_build_const_int32(gallivm, i); >>> LLVMValueRef pixxf = lp_build_const_float(gallivm, quad_offset_x[i]); >>> LLVMValueRef pixyf = lp_build_const_float(gallivm, quad_offset_y[i]); >>> pixoffx = LLVMBuildInsertElement(builder, pixoffx, pixxf, nr, ""); >>> pixoffy = LLVMBuildInsertElement(builder, pixoffy, pixyf, nr, ""); >>> } >>> >>> So, in ppc64le, where coeff_bld->type.length == 4, we *always* take >>> the first four values from the above arrays. The code *never* >>> calculates the coeffs based on the other offsets. >>> >>> Admittedly, that's *almost* always works and I doubt you will see a >>> difference in display. However, in the tests I mentioned above, it >>> creates, for some pixels, a bigger error than what is allowed. >> Well, I'm not sure what error actually is allowed... GL of course is >> vague (the test just seems to be built so the error allowed actually >> passes on real hw). >> >> I don't think there's an actual bug with the code. What it does in >> coeffs_init, is simply calculate the dadx + dady values from one pixel >> to the next (within a quad). >> Then, in attrib_update, it simply uses these values (which are constant >> for all quads, as the position of the pixels within a quad doesn't >> change) to calculate the actual dadq value based on the actual pixel >> offsets, and add that to the (also done in coeffs_init) starting value >> of each quad. >> >> So, the code is correct. However, it still gives different results >> compared to the other method, and the reason is just float math. This >> method does two interpolations (one for the start of the quad, the other >> per pixel), whereas the other method does just one. And this is why the >> results are different, because the results just aren't exact. >> > Yeah, I agree. I also wouldn't say it is a bug per-se, but it is a > method of calculation that produces less accurate results then the > other method (the simple path). > >> >>> >>> The first thing that came into my mind was to change, in >>> lp_build_interp_soa_init(): >>> >>> "if (coeff_type.length > 4)" >>> to this: >>> "if (coeff_type.length >= 4)" >>> >>> When I did that, I saw it fixes the above tests and didn't cause any >>> regressions on ppc64le and/or on x86-64. >>> >>> However, I wanted to ask you guys if you remember why you allowed only >>> vectors with size above 4 to use the "simple" route ? >>> Was there any performance hit ? >> Some history, initially there actually only was the two-stage path (well >> actually it was slightly different as back then the pixel shader wasn't >> even executed in a loop but the idea was the same). The one-stage path >> was actually done when extending to 256bit vectors (or maybe when >> introducing the fs loop, I forgot...) because it was easier to code. >> Albeit both paths were fixed to work both with 128bit and 256bit >> vectors. And indeed, the reason why one is used over the other is just >> performance now (the two-stage approach has a higher initial cost but >> lower per-loop cost, so it's faster when the loop is run four times, but >> not when it's only run two times). >> I was actually just thinking about nuking the two-stage approach >> yesterday as it's quite annoying the results are different depending on >> the vector size. But IIRC the performance difference was quite >> measurable sometimes. >> > That's what I thought. > I'm willing to test the performance on H/W if you can point me to a > relevant test. > By running glxgears, I can see the FPS drops by about 5 FPS (from 605 > to 600), when using the simple method. > Also, it seemed to me, but I didn't measure it, that running piglit > gpu.py was slower by some margin. glxgears is a bit _too_ simple... Something simple which uses "traditional" mul+add shader (2 interpolated attributes, one of them for texture coords) or doing something similarly simple might be a good test. gloss demo, or openarena - desktop compositors would of course be nice, but I don't think they come with benchmark mode...
Roland > >>> >>> I think that to fix coeffs_init()+attribs_update() is to basically >>> re-design them. Otherwise, I would have tried to fix them first. >> >> Well, even the one-stage approach is basically crap. The reason is that >> the start value is always based on the top left corner of the screen >> (coords 0,0 or actually 0.5,0.5). This means if you have some triangle >> far away from the screen origin, but with steep gradients, the errors >> will be huge (but there's no piglit tests which would hit this). >> Modern hw has abandoned this way of interpolation by using barycentric >> interpolation. But it's quite complex to do fast (the hw integrates this >> into rasterization, since the barycentric weights are the same as the >> edge functions). >> Albeit it would be possible to improve this particular problem without >> using barycentric interpolation (could for instance use one corner of >> the triangle as start value instead of 0,0 but adds complexity as well). >> But the coeffs_init/update functions as such are unfixable as it's just >> float math precision issues (well you could try using fma instead of >> mul+add, use double precision, ...) but the fundamental issue that it's >> just less precise doing two-step interpolation instead of one-step remains. >> >> >>> >>> In any case, I would love to hear your feedback and to discuss this further. >>> >> >> Do you see any specific problem outside the 3 failing piglit tests? >> Because as far as I can tell, the error allowed in the tests was really >> just chosen so it works on ordinary hw >> (interface-vs-unnamed-to-fs-unnamed actually has higher allowed error >> than the other two). >> So, I was reluctant to change it just because of these piglits, but this >> isn't set in stone... >> >> Roland >> > No, I couldn't see any other problem that relates to this, as I > understood the error was so small (the results of the interpolation > were off by about 5e-7 then x86-64). > I also run some OpenGL demo that uses gl_ClipDistance and I didn't see > any visual problem. > > However, I would hate to keep the situation as is, meaning the test > passes on x86-64 and fails on ppc64le. > If we agree that there is no actual bug in the code, then another > possibility is to change the failure criteria. Because we know the > maximum deviation between the two methods is 5e-7, we can add 5e-7 (or > 0.5e-6) to the criteria of the three above mentioned tests. That would > make them pass even with the two-stages approach. > > What do you think ? > > Oded > _______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev