Am 20.03.2013 17:46, schrieb Christoph Bumiller: > On 20.03.2013 17:05, Roland Scheidegger wrote: >> Am 20.03.2013 15:41, schrieb Christoph Bumiller: >>> Sorry, this has become longer than I anticipated ... >>> >>> I've been toying with adding support for TGSI_FILE_INPUT/OUTPUT arrays >>> because, since I cannot allocate varyings in the same order that the >>> register index specifies, I need it: >>> >>> === >>> EXAMPLE: >>> OUT[0], CLIPDIST[1], must be allocated at address 0x2c0 in hardware >>> output space >>> OUT[1], CLIPDIST[0], 0x2d0 >>> OUT[2], GENERIC[0], between 0x80 and 0x280 >>> OUT[3], GENERIC[1], between 0x80 and 0x280 >>> >>> And without array specification >>> MOV OUT[TEMP[0].x-1], IMM[0] >>> would leave me no clue as to whether use 0x80 or 0x2c0 as base address. >>> === >>> >>> Now that I'm on it, I'm considering to go a step further, which is >>> adding indirect scalar/component access. >>> This is motivated by float gl_ClipDistance[], which, if accessed >>> indirectly, currently leaves us no choice than generating code like this: >>> >>> if ((index & 3) == 0) access x component; else >>> if ((index & 3) == 1) access y component; ... >>> >>> This is undesirable and the hardware can do better (as it actually >>> supports accessing individual components since address registers contain >>> an address in bytes and we can do scalar read/write). >>> >>> A second motivation is varying packing, which is required by the GL >>> spec, and may lead to use of TEMP arrays, which, albeit improved now, >>> will impair performance when used (on nv50 they go to uncached memory >>> which is very slow). >>> >>> That case occurs if, for instance, a varying float[8] is accessed >>> indirectly and has to be packed into >>> OUT[0..1].xyzw, GENERIC[0..1] >>> instead of >>> OUT[0..7].x, GENERIC[0..7] >>> >>> So far I've come up with 2 choices (all available only if the driver >>> supports e.g. PIPE_CAP_TGSI_SCALAR_REGISTERS): >>> >>> >>> 1. SCALAR DECLARATIONS >>> >>> Using float gl_ClipDistance[8] as example, it could be declared as: >>> >>> OUT[0..7].x, CLIPDIST, ARRAY(1) where the .x now means that it's a >>> single component per OUT[index] >>> >>> Now this obviously means that a single OUT[i] doesn't always consume 16 >>> bytes / 4 components anymore, which may be a somewhat disturbing, since >>> the address of an output can't be directly inferred solely from its >>> index anymore. >>> However, that doesn't really constitute a problem if all access is >>> either direct or comes with an ARRAY() reference. >>> >>> For varying packing, which happens only for user defined variables, and >>> hence TGSI_SEMANTIC_GENERIC, it gets a bit uglier: >>> >>> (NOTE: GL requires us to be able to support exactly the amount of >>> components we report, failing due to alignment is not allowed. Hence the >>> GLSL compiler may put some variables at unaligned locations, see >>> ir_variable.location_frac): >>> >>> A GENERIC semantic index should always cover 4 components so that a >>> fixed location can be assigned for it (drivers usually do this since it >>> makes an extra dynamic linkage pass when shaders are changed >>> unnecessary, as intended by GL_ARB_separate_shader_objects). >>> >>> So, this would be valid: >>> OUT[0..3].x, GENERIC[0] >>> OUT[4..5].xy, GENERIC[1] >>> OUT[6], GENERIC[2] >>> Note how 3 OUT[indices] only consume 1 GENERIC[index]. >>> >>> If we, instead, allocated semantic index per register index instead of >>> per 4 components, we would have: >>> OUT[0..3].x, GENERIC[0] >>> OUT[4..5].xy, GENERIC[4] >>> OUT[6], GENERIC[6] >>> This would >waste space<, since GENERIC[4,6] would have to go to >>> output_space[addresses 0x40, 0x60] so it could link with >>> IN[6], GENERIC[6] >>> where we have no information about the size of GENERIC[0 .. 5], and >>> wasting space like that means the advertised number of varying >>> components cannot be satisfied. >>> >>> >>> And as a last step, if varyings are placed at non-vec4 boundaries, we >>> would have to be able to specify fractional semantic indices, like this: >>> OUT[0..2].x, GENERIC[0].x >>> OUT[3].x, GENERIC[0].w >>> >>> >>> >>> 2. SCALAR ADDRESS REGISTER VALUES >>> >>> All this can be avoided by always declaring full vec4s, and adding the >>> possibility of doing indirect addressing on a per-component basis: >>> >>> varying float a[4] becomes: >>> uniform int i; >>> a[i+5] = 999 becomes: >>> >>> OUT[0].xyzw, ARRAY(1) >>> UARL_SCALAR ADDR[0].x, CONST[0].xxxx >>> MOV OUT(array 1)[ADDR[0].x+1].y, IMM[0].xxxx >>> >>> The only difficulty with this is that we have to split acess TGSI >>> instructions accessing unaligned vectors: >>> (NOTE: this can always be avoided with TGSI_FILE_TEMPORARY, but varyings >>> may have to be packed). >>> >>> With suggestion (1), 2 packed (and hence unaligned) vec3 arrays and a >>> single vec2 would look like this: >>> OUT[0..3].xyz, GENERIC[0].x >>> OUT[4..5].xyz, GENERIC[3].x >>> OUT[6].xy, GENERIC[4].zw >>> and we could still do: >>> ADD OUT[5].xyz, TEMP[0], TEMP[1] >>> >>> Now, these would have to merged declared as: >>> OUT[0..4].xyzw >>> >>> and the 2nd vec3 would be { OUT[0].w, OUT[1].xyz } >>> >>> instead of simply OUT[1].xyz >>> >>> A problem with this is that the GLSL compiler, while it can do the >>> packing into vec4s and splitting up access, cannot, iirc, access >>> individual components of a vec4 indirectly like TGSI would be able to. >>> To avoid TEMP arrays we'd have to disable the last phase of varying >>> packing (that actually converts the code to using vec4s). >>> It would still be able to assign fractional locations to guarantee that >>> linkage works, but glsl-to-tgsi would likely have to split access at >>> vec4 boundaries itself (more work), and declare the whole packed range >>> as a single TGSI array. >>> However, assuming that varyings with the *same* semantic can always be >>> assigned to contiguous slots (output memory space locations) by the >>> driver, and this really only happens for TGSI_SEMANTIC_GENERIC (user >>> varyings), the problem in the example at the top shouldn't arise, and >>> we're able to group all those into a single array. >>> >>> >>> Now, I hope someone was able to get through this and would like to >>> comment :) >> Not sure I fully understand this, but I'm thinking "whenever in doubt, >> use something close to what dx10 does" since that's likely going to work >> reasonable with different hw. Maybe declaring those special values >> differently (not just as output reg) would help? > What DX10 does is making indirect access of varyings illegal. That's not > possible with OpenGL ...
Hmm I thought dcl_indexRange would be used for indirect access of varyings? Roland _______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev