On 02.12.2016 19:46, Roland Scheidegger wrote:
Am 02.12.2016 um 18:23 schrieb Nicolai Hähnle:
On 30.11.2016 21:37, Roland Scheidegger wrote:
Am 30.11.2016 um 20:19 schrieb Nicolai Hähnle:
On 30.11.2016 19:06, Roland Scheidegger wrote:
Am 30.11.2016 um 14:35 schrieb Nicolai Hähnle:
From: Nicolai Hähnle <nicolai.haeh...@amd.com>

This is for geometry shader outputs. Without it, drivers have no
way of
knowing which stream each output is intended for, and have to
conservatively write all outputs to all streams.

Separate stream numbers for each component are required due to output
packing.
Are you sure this is true?
This is an area I don't know much about, but
https://urldefense.proofpoint.com/v2/url?u=https-3A__www.opengl.org_wiki_Layout-5FQualifier-5F-28GLSL-29&d=DgIDaQ&c=uilaK90D4TOVoH58JNXRgQ&r=_QIjpv-UJ77xEQY8fIYoQtr5qv8wKrPJc7v7_-CYAb0&m=fVpTGTYN2KTEhU17RpFTxEULrsIfC3bdpEin0k8NIYE&s=uamnHj-9Xr12ctr0gHDfCMIMHq8DyUBtKIwHQQpjDLs&e=

tells me "Stream
assignments for a geometry shader are required to be the same for all
members of a block, but offsets are not."

Therefore I don't think output packing should ever happen across
multiple streams. I think it would be MUCH nicer if the semantic needed
just one stream member...

There are two variants of that question, I guess.

The answer to the first variant is: Yes, this is currently true.
lower_packed_varyings will happily pack outputs from different vertex
streams into the same vec4. This affects quite a lot of programs, e.g.
you see it in piglit arb_gpu_shader5-xfb-streams.

The second question is: Do we want it to be true? I agree that it would
be convenient to be able to use a single Stream member. Also, isolating
the stream0 components from the rest would lead to slightly more
efficient shaders for us in some cases.

I opted against it so far because I didn't want to think through the
implications of changing lower_packed_varyings. The main question I have
is: if you account for the size of the GS output in # of components,
then it could happen that the number of output vec4s ends up being
larger than (max # of output components) / 4. Will that be a problem
somewhere?

I don't know if that would be a problem, but if it is I'd assume this
would be fixable (since the number of actual components ultimately
doesn't change).
Having outputs belonging to multiple streams in a single output just
seems weird...
That said, I wonder if it actually would be possible to do that with
d3d11 too.
With shader model 5 you'd have:
dcl_stream 0
dcl_output o0.xy
dcl_stream 1
dcl_output o0.zw // legal or not???

Though the shader model 4/5 rules are a bit weird for packing
inputs/outputs, I'm not even sure two dcl_output are legal for the same
reg without a dcl_stream in between them (but you can pack system values
together with ordinary inputs/outputs).

So maybe just allowing this is the right solution...

I played around with the DX shader compiler, and I have some annoying
news. SM5 actually uses not just the same output register but even the
same component for multiple streams -- see the output I've pasted at the
end.

So how to proceed? To simplify things going forward, I'm mostly
convinced that the GLSL output packing should be changed to pack outputs
by stream. As I mentioned previously, this has other minor advantages
for us anyway.

Then one possibility to accomodate SM5 would be to have a Stream
bitmask, one bit per stream, as part of the output semantics. The
downside of this is that I wanted to use the WriteMask as an additional
optimization to avoid writing out unused components, and you'd then need
separate WriteMasks for each stream.

The other possibility, which I prefer, would be to have just a single
Stream field indicating one stream number per output register, and
aliasing is just not allowed despite what SM5 wants.

I have to go back on that unfortunately: I forgot that it's possible to create location aliasing across vertex streams via ARB_enhanced_layouts. I looked hard and found nothing in the spec that would forbid it, and our closed source driver also allows it.

So my plan now is to leave the StreamXYZW stuff as is. I will send around a v2 of this series to account for this use case (because there's still a problem in the GLSL-to-TGSI translation), plus some radeonsi-specific additions.

I'm also going to send a piglit test around that exercises this.

Cheers,
Nicolai


TGSI -> SM5 conversion is trivial.

SM5 -> TGSI conversion is also possible despite the aliasing on the DX
side, because the doc says this about emit_stream: "Af[t]er the emit,
all data in all output registers for all streams become uninitialized,
not just the stream emitted to."
Oh that's pretty interesting, since emit didn't have that part about
outputs becoming uninitialized. Maybe that's just what was needed to
keep implementations sane when allowing the crazy "same output multiple
stream" stuff... Or I suppose it's not actually that crazy then...


(https://urldefense.proofpoint.com/v2/url?u=https-3A__msdn.microsoft.com_en-2Dus_library_windows_desktop_hh447051-28v-3Dvs.85-29.aspx&d=DgIDaQ&c=uilaK90D4TOVoH58JNXRgQ&r=_QIjpv-UJ77xEQY8fIYoQtr5qv8wKrPJc7v7_-CYAb0&m=EBMBRMVpTcLbno2cH7eaI5WJW9VY3tec7RBNULl1btw&s=HJ2sRJpROX7JfDvjHycEwHAx6YzJa8RUa1biVttH-zM&e=
). So you have to look-ahead to the next emit_stream for disambiguation,
but it's clearly doable.

Any objections to that approach?
Sounds good to me. I agree it would be complicated for tgsi to do what
sm5 wants directly - there's other stuff we already have to translate
anyway there (the packing of system values and ordinary inputs/outputs).
I suppose when we'll need this we could just use multiple outputs
instead of one when they share a stream.

Roland



Thanks,
Nicolai
---
//
// Generated by Microsoft (R) HLSL Shader Compiler 10.0.10011.16384
//
//
//
// Input signature:
//
// Name                 Index   Mask Register SysValue  Format   Used
// -------------------- ----- ------ -------- -------- ------- ------
// SV_POSITION              0   xyzw        0      POS   float   xyzw
// TEXCOORD                 0   xyz         1     NONE   float
// TEXCOORD                 1   xy          2     NONE   float
//
//
// Output signature:
//
// Name                 Index   Mask Register SysValue  Format   Used
// -------------------- ----- ------ -------- -------- ------- ------
// m0:TEXCOORD              0   x           0     NONE   float   x
// m0:TEXCOORD              1    y          0     NONE   float    y
// m1:TEXCOORD              0   x           0     NONE   float   x
// m1:TEXCOORD              1    y          0     NONE   float    y
//
gs_5_0
dcl_globalFlags refactoringAllowed
dcl_input_siv v[3][0].xyzw, position
dcl_input v[3][1].xyz
dcl_input v[3][2].xy
dcl_inputprimitive triangle
dcl_stream m0
dcl_outputtopology pointlist
dcl_output o0.x
dcl_output o0.y
dcl_stream m1
dcl_outputtopology pointlist
dcl_output o0.x
dcl_output o0.y
dcl_maxout 12
mov o0.xy, v[0][0].xzxx
emit_stream m0
mov o0.xy, v[1][0].ywyy
emit_stream m1
ret
// Approximately 5 instruction slots used

_______________________________________________
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev

Reply via email to