On Thu, 2016-08-18 at 12:08 -0700, Francisco Jerez wrote: > Iago Toral <ito...@igalia.com> writes: > > > > > On Wed, 2016-08-17 at 15:15 -0700, Francisco Jerez wrote: > > > > > > Iago Toral <ito...@igalia.com> writes: > > > > > > > > > > > > > > > On Tue, 2016-08-02 at 18:27 -0700, Francisco Jerez wrote: > > > > > > > > > > > > > > > Iago Toral Quiroga <ito...@igalia.com> writes: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > --- > > > > > > src/mesa/drivers/dri/i965/brw_vec4_nir.cpp | 18 > > > > > > ++++++++++++++++++ > > > > > > 1 file changed, 18 insertions(+) > > > > > > > > > > > > diff --git a/src/mesa/drivers/dri/i965/brw_vec4_nir.cpp > > > > > > b/src/mesa/drivers/dri/i965/brw_vec4_nir.cpp > > > > > > index 1525a3d..4014020 100644 > > > > > > --- a/src/mesa/drivers/dri/i965/brw_vec4_nir.cpp > > > > > > +++ b/src/mesa/drivers/dri/i965/brw_vec4_nir.cpp > > > > > > @@ -1497,6 +1497,24 @@ > > > > > > vec4_visitor::nir_emit_alu(nir_alu_instr > > > > > > *instr) > > > > > > emit(CMP(dst, op[0], brw_imm_f(0.0f), > > > > > > BRW_CONDITIONAL_NZ)); > > > > > > break; > > > > > > > > > > > > + case nir_op_d2b: { > > > > > > + /* two-argument instructions can't take 64-bit > > > > > > immediates */ > > > > > > + dst_reg zero = dst_reg(this, glsl_type::dvec4_type); > > > > > > + emit(MOV(zero, brw_imm_df(0.0))); > > > > > > + > > > > > > + dst_reg tmp = dst_reg(this, glsl_type::dvec4_type); > > > > > > + emit(CMP(tmp, op[0], src_reg(zero), > > > > > > BRW_CONDITIONAL_NZ)); > > > > > > + > > > > > > + /* Convert the double CMP result to a single boolean > > > > > > result. > > > > > > For that > > > > > > + * we take the low 32-bit chunk of each DF component > > > > > > in > > > > > > the > > > > > > result. > > > > > > + * and do a final MOV to honor the original > > > > > > writemask > > > > > > + */ > > > > > > + dst_reg result = dst_reg(this, > > > > > > glsl_type::bvec4_type); > > > > > > + emit(VEC4_OPCODE_PICK_LOW_32BIT, result, > > > > > > src_reg(tmp)); > > > > > > + emit(MOV(dst, src_reg(result))); > > > > > Couldn't you just do a single CMP instruction of the double- > > > > > precision > > > > > argument and a single-precision 0.0 immediate? > > > > It never occurred to me that we could do that but it seems to > > > > work > > > > just > > > > fine. > > > > > > > > > > > > > > > > > > > I think you could > > > > > potentially also use a 32bit destination type on the CMP > > > > > instruction > > > > > so > > > > > you don't need to emit the PICK_LOW+MOV instructions > > > > > afterwards. > > > > This does not seem to work though. > > > > > > > > > > > > > > > > > > > It may > > > > > hit an instruction decompression bug but it could be cleaned > > > > > up > > > > > by > > > > > the > > > > > SIMD lowering pass afterwards, AFAICT the result would be two > > > > > uncompressed instructions instead of the two uncompressed > > > > > plus > > > > > two > > > > > compressed instructions above. > > > > I tried splitting the instruction just in case. Notice that > > > > doing > > > > this > > > > requires that we improve the splitting pass to support > > > > splitting > > > > instructions that write only one register (the original series > > > > did > > > > not > > > > implement support for that because so far the only cared to > > > > split > > > > DF > > > > instructions that wrote more than one register). I implemented > > > > a > > > > couple > > > > of quick hacks to get this done so I could test this but the > > > > result > > > > is > > > > still incorrect. > > > > > > > > For reference, if we don't split, the resulting CMP looks like > > > > this > > > > (after full scalarization): > > > > > > > > cmp.nz.f0(8) g8<1>.xD g6<2,2,1>.xyxyDF 0F { align16 1Q }; > > > > cmp.nz.f0(8) g8<1>.yD g6<2,2,1>.zwzwDF 0F { align16 1Q }; > > > > cmp.nz.f0(8) g8<1>.zD g6.2<0,2,1>.xyxyDF 0F { align16 1Q }; > > > > cmp.nz.f0(8) g8<1>.wD g6.2<0,2,1>.zwzwDF 0F { align16 1Q }; > > > > > > > > And if we split the instruction in two we produce this: > > > > > > > > cmp.nz.f0(4) g6<1>.xD g1<0,2,1>.xyxyDF 0F { align16 1N > > > > }; > > > > cmp.nz.f0(4) g6<1>.yD g1<0,2,1>.zwzwDF 0F { align16 1N > > > > }; > > > > cmp.nz.f0(4) g6<1>.zD g1.2<0,2,1>.xyxyDF 0F { align16 1N > > > > }; > > > > cmp.nz.f0(4) g6<1>.wD g1.2<0,2,1>.zwzwDF 0F { align16 1N > > > > }; > > > > cmp.nz.f0(4) g6.4<1>.xD g1<0,2,1>.xyxyDF 0F { align16 2N > > > > }; > > > > cmp.nz.f0(4) g6.4<1>.yD g1<0,2,1>.zwzwDF 0F { align16 2N > > > > }; > > > > cmp.nz.f0(4) g6.4<1>.zD g1.2<0,2,1>.xyxyDF 0F { align16 2N > > > > }; > > > > cmp.nz.f0(4) g6.4<1>.wD g1.2<0,2,1>.zwzwDF 0F { align16 2N > > > > }; > > > > > > > > Both sequences of instructions look correct to me as far as the > > > > splitting and DF regioning go, so I guess the 32-bit CMP result > > > > is > > > > not > > > > doing what we want. > > > > > > > > I am not too surprised about this. Even if we use a 32b dst > > > > type I > > > > suppose the instruction execution type is still 64b so I guess > > > > there > > > > should be a 64b to 32b conversion involved in writing to a 32b > > > > result. > > > > Seeing how 64b to 32b conversions require that we emit specific > > > > code > > > > using ALIGN1 mode to deal with alignment requirements I guess > > > > it is > > > > not > > > > too surprising that this does not work, right? > > > > > > > Yeah, that's not too surprising, don't worry about it. Another > > > idea > > > would be to do something along the lines of 'MOV.nz null, <df > > > source > > > goes here>' to check whether the source is non-zero (so you avoid > > > loading the immediate which is a non-trivial operation on Gen7), > > > and > > > then use a single-precision SEL instruction (or two predicated > > > MOV > > > instructions) to write true or false to the destination (so you > > > don't > > > need the PICK_LOW+MOV sequence). > > Using MOV.nz seems to have a different behavior than CMP.nz. The > > former > > will consider anything other than 0.0 to be non-zero, whereas the > > CMP > > version will consider small enough DF values to be zero as well, > > which > > I think is the behavior we want (I think I remember discussing this > > here when we were implementing this for broadwell). > > > What do you mean by small enough? Denormalized?
Yes, denormalized values, sorry I wasn't more clear. > MOV instructions can > flush them to zero as well, only raw moves won't, you can make sure > that > the MOV is not raw by applying a source modifier to the source which > shouldn't affect the flag result for normalized sources. Oh, didn't know that, I can try this. > > > > Fortunately, we can do something like this: > > > > src_reg one = src_reg(this, glsl_type::ivec4_type); > > emit(MOV(dst_reg(one), brw_imm_d(~0))); > > > > vec4_instruction *inst = > > emit(CMP(dst_null_df(), op[0], brw_imm_f(0.0f), > > BRW_CONDITIONAL_NZ)); > > inst->regs_written = 1; > > > > inst = emit(BRW_OPCODE_SEL, dst, one, brw_imm_d(0)); > > inst->predicate = BRW_PREDICATE_NORMAL; > > > > which is pretty much the same idea (notice that we use a float > > constant > > for the CMP) and keeps the original behavior. > > > Hm, I thought you said that using an F immediate on a double- > precision > instruction doesn't work? No, that seems to work fine, at least if the float constant is 0.0 it seems to produce correct results in all the cases I tested. What doesn't work is the part where we also make the CMP produce a 32-bit result. > > > > If you are curious about the inst->regs_written=1 in the CMP it is > > just because the vec4_constructor will figure out that because it > > has > > a DF destination it writes two registers, but it really doesn't > > because it is a null dst register. If we leave it at 2, then the > > simd > > lowering pass will split the CMP instruction because the 0.0f > > constant > > argument only reads one register. > That seems like a bug. There is no need to lower an instruction with > a > mismatch in the number of registers read and written if the source is > uniform: > > > > > When destination spans two registers, the source MUST span two > > registers. The exception to the above rule: > > > > - When source is scalar, the source registers are not > > incremented. > > - When source is packed integer Word and destination is > > packed integer DWord, the source register is not incremented > > but the source sub register is incremented. > The FS back-end implements both exceptions, but there's probably no > need > to implement the second one in the VEC4 back-end. > > > > > Maybe we should make vec4_instruction set only 1 register > > written for null dsts or just make the simd lowering pass ignore > > the > > number of registers written for instructions with a null dst? > > > > Iago > > > > > > > > > > > > > > > > > I can send you a trace if you want to have a look at it. > > > > > > > > Iago > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > + break; > > > > > > + } > > > > > > + > > > > > > case nir_op_i2b: > > > > > > emit(CMP(dst, op[0], brw_imm_d(0), > > > > > > BRW_CONDITIONAL_NZ)); > > > > > > break; _______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/mesa-dev