On 20.03.2013 17:05, Roland Scheidegger wrote: > Am 20.03.2013 15:41, schrieb Christoph Bumiller: >> Sorry, this has become longer than I anticipated ... >> >> I've been toying with adding support for TGSI_FILE_INPUT/OUTPUT arrays >> because, since I cannot allocate varyings in the same order that the >> register index specifies, I need it: >> >> === >> EXAMPLE: >> OUT[0], CLIPDIST[1], must be allocated at address 0x2c0 in hardware >> output space >> OUT[1], CLIPDIST[0], 0x2d0 >> OUT[2], GENERIC[0], between 0x80 and 0x280 >> OUT[3], GENERIC[1], between 0x80 and 0x280 >> >> And without array specification >> MOV OUT[TEMP[0].x-1], IMM[0] >> would leave me no clue as to whether use 0x80 or 0x2c0 as base address. >> === >> >> Now that I'm on it, I'm considering to go a step further, which is >> adding indirect scalar/component access. >> This is motivated by float gl_ClipDistance[], which, if accessed >> indirectly, currently leaves us no choice than generating code like this: >> >> if ((index & 3) == 0) access x component; else >> if ((index & 3) == 1) access y component; ... >> >> This is undesirable and the hardware can do better (as it actually >> supports accessing individual components since address registers contain >> an address in bytes and we can do scalar read/write). >> >> A second motivation is varying packing, which is required by the GL >> spec, and may lead to use of TEMP arrays, which, albeit improved now, >> will impair performance when used (on nv50 they go to uncached memory >> which is very slow). >> >> That case occurs if, for instance, a varying float[8] is accessed >> indirectly and has to be packed into >> OUT[0..1].xyzw, GENERIC[0..1] >> instead of >> OUT[0..7].x, GENERIC[0..7] >> >> So far I've come up with 2 choices (all available only if the driver >> supports e.g. PIPE_CAP_TGSI_SCALAR_REGISTERS): >> >> >> 1. SCALAR DECLARATIONS >> >> Using float gl_ClipDistance[8] as example, it could be declared as: >> >> OUT[0..7].x, CLIPDIST, ARRAY(1) where the .x now means that it's a >> single component per OUT[index] >> >> Now this obviously means that a single OUT[i] doesn't always consume 16 >> bytes / 4 components anymore, which may be a somewhat disturbing, since >> the address of an output can't be directly inferred solely from its >> index anymore. >> However, that doesn't really constitute a problem if all access is >> either direct or comes with an ARRAY() reference. >> >> For varying packing, which happens only for user defined variables, and >> hence TGSI_SEMANTIC_GENERIC, it gets a bit uglier: >> >> (NOTE: GL requires us to be able to support exactly the amount of >> components we report, failing due to alignment is not allowed. Hence the >> GLSL compiler may put some variables at unaligned locations, see >> ir_variable.location_frac): >> >> A GENERIC semantic index should always cover 4 components so that a >> fixed location can be assigned for it (drivers usually do this since it >> makes an extra dynamic linkage pass when shaders are changed >> unnecessary, as intended by GL_ARB_separate_shader_objects). >> >> So, this would be valid: >> OUT[0..3].x, GENERIC[0] >> OUT[4..5].xy, GENERIC[1] >> OUT[6], GENERIC[2] >> Note how 3 OUT[indices] only consume 1 GENERIC[index]. >> >> If we, instead, allocated semantic index per register index instead of >> per 4 components, we would have: >> OUT[0..3].x, GENERIC[0] >> OUT[4..5].xy, GENERIC[4] >> OUT[6], GENERIC[6] >> This would >waste space<, since GENERIC[4,6] would have to go to >> output_space[addresses 0x40, 0x60] so it could link with >> IN[6], GENERIC[6] >> where we have no information about the size of GENERIC[0 .. 5], and >> wasting space like that means the advertised number of varying >> components cannot be satisfied. >> >> >> And as a last step, if varyings are placed at non-vec4 boundaries, we >> would have to be able to specify fractional semantic indices, like this: >> OUT[0..2].x, GENERIC[0].x >> OUT[3].x, GENERIC[0].w >> >> >> >> 2. SCALAR ADDRESS REGISTER VALUES >> >> All this can be avoided by always declaring full vec4s, and adding the >> possibility of doing indirect addressing on a per-component basis: >> >> varying float a[4] becomes: >> uniform int i; >> a[i+5] = 999 becomes: >> >> OUT[0].xyzw, ARRAY(1) >> UARL_SCALAR ADDR[0].x, CONST[0].xxxx >> MOV OUT(array 1)[ADDR[0].x+1].y, IMM[0].xxxx >> >> The only difficulty with this is that we have to split acess TGSI >> instructions accessing unaligned vectors: >> (NOTE: this can always be avoided with TGSI_FILE_TEMPORARY, but varyings >> may have to be packed). >> >> With suggestion (1), 2 packed (and hence unaligned) vec3 arrays and a >> single vec2 would look like this: >> OUT[0..3].xyz, GENERIC[0].x >> OUT[4..5].xyz, GENERIC[3].x >> OUT[6].xy, GENERIC[4].zw >> and we could still do: >> ADD OUT[5].xyz, TEMP[0], TEMP[1] >> >> Now, these would have to merged declared as: >> OUT[0..4].xyzw >> >> and the 2nd vec3 would be { OUT[0].w, OUT[1].xyz } >> >> instead of simply OUT[1].xyz >> >> A problem with this is that the GLSL compiler, while it can do the >> packing into vec4s and splitting up access, cannot, iirc, access >> individual components of a vec4 indirectly like TGSI would be able to. >> To avoid TEMP arrays we'd have to disable the last phase of varying >> packing (that actually converts the code to using vec4s). >> It would still be able to assign fractional locations to guarantee that >> linkage works, but glsl-to-tgsi would likely have to split access at >> vec4 boundaries itself (more work), and declare the whole packed range >> as a single TGSI array. >> However, assuming that varyings with the *same* semantic can always be >> assigned to contiguous slots (output memory space locations) by the >> driver, and this really only happens for TGSI_SEMANTIC_GENERIC (user >> varyings), the problem in the example at the top shouldn't arise, and >> we're able to group all those into a single array. >> >> >> Now, I hope someone was able to get through this and would like to >> comment :) > Not sure I fully understand this, but I'm thinking "whenever in doubt, > use something close to what dx10 does" since that's likely going to work > reasonable with different hw. Maybe declaring those special values > differently (not just as output reg) would help? What DX10 does is making indirect access of varyings illegal. That's not possible with OpenGL ...
> Roland > _______________________________________________ > mesa-dev mailing list > mesa-dev@lists.freedesktop.org > http://lists.freedesktop.org/mailman/listinfo/mesa-dev _______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev