Kenneth Graunke <kenn...@whitecape.org> writes: > Previously, the scalar TCS backend was generating: > > mov(8) g17<1>UD 0x00000000UD { align1 WE_all 1Q compacted }; > and(8) g17.2<1>UD g0.2<0,1,0>UD 0x0001e000UD { align1 WE_all 1Q }; > shl(8) g17.2<1>UD g17.2<8,8,1>UD 0x0000000bUD { align1 WE_all 1Q }; > or(8) g17.2<1>UD g17.2<8,8,1>UD 0x00008200UD { align1 WE_all 1Q }; > send(8) null<1>UW g17<8,8,1>UD > gateway (barrier msg) mlen 1 rlen 0 { align1 WE_all 1Q }; > > This is rubbish - g17.2<8,8,1>UD spans two registers, and is an illegal > region. Not to mention it clobbers 8 channels of data when we only > wanted to touch m0.2. > > Instead, we want: > > mov(8) g17<1>UD 0x00000000UD { align1 WE_all 1Q compacted }; > and(1) g17.2<1>UD g0.2<0,1,0>UD 0x0001e000UD { align1 WE_all }; > shl(1) g17.2<1>UD g17.2<0,1,0>UD 0x0000000bUD { align1 WE_all }; > or(1) g17.2<1>UD g17.2<0,1,0>UD 0x00008200UD { align1 WE_all }; > send(8) null<1>UW g17<8,8,1>UD > gateway (barrier msg) mlen 1 rlen 0 { align1 WE_all 1Q }; > > Using component() accomplishes this. > > Fixes GL44-CTS.tessellation_shader.tessellation_shader_tc_barriers. > barrier_guarded_read_write_calls on Skylake. Probably fixes other > barrier issues on Gen8+. > > Cc: mesa-sta...@lists.freedesktop.org > Signed-off-by: Kenneth Graunke <kenn...@whitecape.org> > --- > src/mesa/drivers/dri/i965/brw_fs_nir.cpp | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/src/mesa/drivers/dri/i965/brw_fs_nir.cpp > b/src/mesa/drivers/dri/i965/brw_fs_nir.cpp > index df033e1..bb254d6 100644 > --- a/src/mesa/drivers/dri/i965/brw_fs_nir.cpp > +++ b/src/mesa/drivers/dri/i965/brw_fs_nir.cpp > @@ -2420,7 +2420,7 @@ fs_visitor::nir_emit_tcs_intrinsic(const fs_builder > &bld, > break; > > fs_reg m0 = bld.vgrf(BRW_REGISTER_TYPE_UD, 1); > - fs_reg m0_2 = byte_offset(m0, 2 * sizeof(uint32_t)); > + fs_reg m0_2 = component(m0, 2); > I guess this works because component() sets the stride of the destination to zero (which is not a real destination stride supported by the hardware), which in turns causes brw_reg_from_fs_reg() to emit a 1-wide region for the destination, which in turn causes the exec size hack in brw_set_dest() to kick in and override the execution size of the instruction, ignoring the original execution size coming in from the IR.
Even though that may give you the assembly you want, it seems rather fragile to rely on a number non-obvious things for the assembly to be correct, and this still leaves the exec_size field of the IR instruction equal to the shader's dispatch width (which isn't going to be the real execution size of the instruction), so the rest of the optimizer may not treat this instruction the way you expected even if the final assembly looks correct. > const fs_builder fwa_bld = bld.exec_all(); > How about you make this a scalar builder in addition (like bld.exec_all().group(1, 0))? (using component() instead of byte_offset() above still seems like a positive change for the sake of readability) > -- > 2.9.3 > > _______________________________________________ > mesa-stable mailing list > mesa-sta...@lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/mesa-stable
signature.asc
Description: PGP signature
_______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/mesa-dev