On Thu, Aug 24, 2017 at 6:54 AM, Alejandro Piñeiro <apinhe...@igalia.com> wrote:
> From: Jose Maria Casanova Crespo <jmcasan...@igalia.com> > > From Skylake PRM, Volume 07 3D Media GPGPU, Section Register Region > Restrictions, SubSection 5. Special Cases for Word Operations (page > 781): > > "There is a relaxed alignment rule for word destinations. When the > destination type is word (UW, W, HF), destination data types can > be aligned to either the lowest word or the second lowest word of > the execution channel. This means the destination data words can > be either all in the even word locations or all in the odd word > locations. > > // Example: > add (8) r10.0<2>:hf r11.0<8;8,1>:f r12.0<8;8,1>:hf > add (8) r10.1<2>:hf r11.0<8;8,1>:f r12.0<8;8,1>:hf > // Note: The destination offset may be .0 or .1 although the > destination > subregister > // is required to be aligned to execution datatype." > > In practice, that means that for alu operations like float to half > float conversions, we need to align the 16-bit to 32-bit (so setting > the stride to 2). > I believe your application of this rule is a bit imprecise. My understanding of the hardware is that the rule stated above is a modification to the following rule from the previous page: > When the Execution Data Type is wider than the destination data type, the destination must > be aligned as required by the wider execution data type and specify a HorzStride equal to > the ratio in sizes of the two data types. For example, a mov with a D source and B destination > must use a 4-byte aligned destination and a Dst.HorzStride of 4. In other words, the rule you stated above only applies in the case of ALU operations with a source type wider than the destination. If all operands to an ALU operation are word types, the destination can be tightly packed. We have a decent bit of code in the driver today which does ALU operations that write to word destinations with a stride of 1; all of it is done with W or DW and not HF but I believe the same principal applies to HF. In particular, this means that basically the only time we care about this rule is for nir_op_f2f16 and f2f32 (with a 64-bit source). The rest of the ALU operations should be fine as-is. (For nir_op_f2f32 with a 16-bit source, the destination is wider than the source so the restriction does not apply.) I'd prefer we let 16-bit values be tightly packed by default and implement f32->f16 conversions as mov (8) r10.0<2>:hf r11.0<8;8,1>f mov (8) r12.0<1>:hf r10.0<16;8,2>hf The optimizer can probably sort things out for us and get rid of the second mov. --Jason As a first approach, we are setting the stride to 2 in general, even > if it is not an alu operation, as simplifies handling with 16-bit data > types. We can revisit this in the future. > > Signed-off-by: Jose Maria Casanova Crespo <jmcasan...@igalia.com> > Signed-off-by: Alejandro Piñeiro <apinhe...@igalia.com> > --- > src/intel/compiler/brw_fs_nir.cpp | 35 ++++++++++++++++++++++++++++++ > ++++- > 1 file changed, 34 insertions(+), 1 deletion(-) > > diff --git a/src/intel/compiler/brw_fs_nir.cpp > b/src/intel/compiler/brw_fs_nir.cpp > index e4eba1401f8..c469baf42a1 100644 > --- a/src/intel/compiler/brw_fs_nir.cpp > +++ b/src/intel/compiler/brw_fs_nir.cpp > @@ -585,6 +585,21 @@ fs_visitor::nir_emit_alu(const fs_builder &bld, > nir_alu_instr *instr) > result.type = brw_type_for_nir_type(devinfo, > (nir_alu_type)(nir_op_infos[instr->op].output_type | > nir_dest_bit_size(instr->dest.dest))); > + /* According to PRM, Volume 07 3D Media GPGPU, Section Register Region > + * Restrictions, SubSection 5. Special Cases for Word Operations, GEN > >= 8 > + * can only handle writting on 16-bit (UW,HF,W) registers on the lower > + * word or in the higher word for each double word. It needs to be the > + * same offset for for all 16-bit elements of the register. So as > + * consequence all destinations of 16-bit registers should use a > stride > + * equal to 2. > + * > + * We also need alignment to 32-bit for conversions operations > + * form 32-bit to 16-bit. So this is also useful for this operations > + * that are the only alu operation supported right now. > + */ > + > + if (nir_dest_bit_size(instr->dest.dest) == 16) > + result.stride = 2; > > fs_reg op[4]; > for (unsigned i = 0; i < nir_op_infos[instr->op].num_inputs; i++) { > @@ -594,6 +609,9 @@ fs_visitor::nir_emit_alu(const fs_builder &bld, > nir_alu_instr *instr) > nir_src_bit_size(instr->src[i].src))); > op[i].abs = instr->src[i].abs; > op[i].negate = instr->src[i].negate; > + /* Previous comment still aplies here for source elements */ > + if (nir_src_bit_size(instr->src[i].src) == 16) > + op[i].stride = 2; > } > > /* We get a bunch of mov's out of the from_ssa pass and they may still > @@ -2083,6 +2101,9 @@ fs_visitor::emit_gs_input_load(const fs_reg &dst, > first_component = first_component / 2; > } > > + if (type_sz(tmp_dst.type) == 2) > + tmp_dst.stride = 2; > + > for (unsigned iter = 0; iter < num_iterations; iter++) { > if (offset_const) { > /* Constant indexing - use global offset. */ > @@ -2414,6 +2435,9 @@ fs_visitor::nir_emit_tcs_intrinsic(const fs_builder > &bld, > dst = tmp; > } > > + if (type_sz(dst.type) == 2) > + dst.stride = 2; > + > for (unsigned iter = 0; iter < num_iterations; iter++) { > if (indirect_offset.file == BAD_FILE) { > /* Constant indexing - use global offset. */ > @@ -2725,6 +2749,9 @@ fs_visitor::nir_emit_tes_intrinsic(const fs_builder > &bld, > first_component = first_component / 2; > } > > + if (type_sz(dest.type) == 2) > + dest.stride = 2; > + > fs_inst *inst; > if (indirect_offset.file == BAD_FILE) { > /* Arbitrarily only push up to 32 vec4 slots worth of data, > @@ -2735,7 +2762,7 @@ fs_visitor::nir_emit_tes_intrinsic(const fs_builder > &bld, > fs_reg src = fs_reg(ATTR, imm_offset / 2, dest.type); > for (int i = 0; i < instr->num_components; i++) { > unsigned comp = 16 / type_sz(dest.type) * (imm_offset % 2) > + > - i + first_component; > + i * dest.stride + first_component; > bld.MOV(offset(dest, bld, i), component(src, comp)); > } > tes_prog_data->base.urb_read_length = > @@ -3182,6 +3209,9 @@ fs_visitor::nir_emit_fs_intrinsic(const fs_builder > &bld, > num_components *= 2; > } > > + if (type_sz(dest.type) == 2) > + dest.stride = 2; > + > for (unsigned int i = 0; i < num_components; i++) { > struct brw_reg interp = interp_reg(base, component + i); > interp = suboffset(interp, 3); > @@ -3576,6 +3606,9 @@ fs_visitor::nir_emit_intrinsic(const fs_builder > &bld, nir_intrinsic_instr *instr > if (nir_intrinsic_infos[instr->intrinsic].has_dest) > dest = get_nir_dest(instr->dest); > > + if (instr->dest.is_ssa && nir_dest_bit_size(instr->dest) == 16) > + dest.stride = 2; > + > switch (instr->intrinsic) { > case nir_intrinsic_atomic_counter_inc: > case nir_intrinsic_atomic_counter_dec: > -- > 2.11.0 > > _______________________________________________ > mesa-dev mailing list > mesa-dev@lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/mesa-dev >
_______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/mesa-dev