On 09/17/2013 05:13 AM, Rogovin, Kevin wrote: > Hello, > > Thank you for the very fast answers, some more questions: > > >> It's not a preference question. The registers are 8 floats wide. >> Vertex shaders get invoked 2 vertices at a time, with a register containing >> these values: >> >> . +------+------+------+------+------+------+------+------+ >> . | v0.x | v0.y | v0.z | v0.w | v1.x | v1.y | v1.z | v1.w | >> . +------+------+------+------+------+------+------+------+ > > This seems best to me: run two vertices in each invocation with the hopes > that the > shader compiler will merge (multiple) float, vec2 and maybe even vec3 > operations into > vec4 operations (does it)?
Not as well as it should. There's a lot of room for improvement in our SIMD4x2/vector backend. We haven't spent a ton of effort optimizing it since vertex shaders have rarely been the bottleneck in application performance. >> while these 8 pixels in screen space: >> >> . +----+----+----+----+ >> . | p0 | p1 | p2 | p3 | >> . +----+----+----+----+ >> . | p4 | p5 | p6 | p7 | >> . +----+----+----+----+ >> >> are loaded in fragment shader registers as: >> >> . +------+------+------+------+------+------+------+------+ >> . | p0.x | p1.x | p4.x | p5.x | p2.x | p3.x | p6.x | p7.x | >> . +------+------+------+------+------+------+------+------+ >> >> Note how one register just holds a single channel ('.x' here) of a vector. >> A vec4 would take up 4 registers, and to do value0.xyzw * value1.xyzw, you'd >> emit 4 MULs. > > This is exactly what I was trying to ask/say about the fragment shader > running, i.e. n-fragments are processed with 1 n-SIMD command (for i965, n=8), > sighs my e-mail communications leave something to be desired. > Some questions: > 1) do the fragments need to be in a 4x2 block, or can it be two separate 2x2 > blocks? The GPU processes two separate 2x2 blocks of pixels, which may actually not be anywhere near each other. > 2) for tiny triangles for fragment shaders that do not require dFdx, dFdy or > fwidth, can the fragments be totally scattered? Nope, the pixel shader always works on 2x2 blocks. > Along further lines, for non-dependent texture lookups, are there code lines > where the derivatives are computed > analytically so that selecting the correct LOD does not require to process > fragments in 2x2 (or larger) blocks? Or does > the i965 hardware sampler interface does not allow this kind of madness? > >>> On a related note, where are the beans about the dispatch table? >> I don't know this one (or particularly what you're asking, I guess). > > Viewing docs/index.html, on the side panel "Developer Topics --> GL > Dispatch" there is text (broken into sections "1. Complexity of GL > Dispatch", "2. Overview of Mesa's Implementation" and "3. Optimizations > " describing how different GL contexts for the same hardware can do > different things for the same GL function and that mesa has stubs which > in turn call the "real" function. The documents go on to talk about > various ways the function tables are filled and accessed across separate > threads. My questions are: > 0) is that information text still accurate? In particular, the directory > src/glapi is gone from Mesa (atleast what I git cloned) and I thought that > was the location of it. > 1) where/how does the i965 driver fill that table, if it exists? > > Along similar lines, I see that some of the code in src/mesa/main performs > various checks of various API calls and at times has some conditions > dependent on what context type it is, which kind of contradicts the idea of > different context have different dispatch tables [sort of, since the > functions might just be the driver magick, where as the stub is validate and > then call driver magick]. > > -Kevin _______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev