Connor Abbott <cwabbo...@gmail.com> writes: > On Mon, Aug 17, 2015 at 2:41 AM, Francisco Jerez <curroje...@riseup.net> > wrote: >> Connor Abbott <cwabbo...@gmail.com> writes: >> >>> On Mon, Aug 17, 2015 at 12:09 AM, Pohjolainen, Topi >>> <topi.pohjolai...@intel.com> wrote: >>>> On Fri, Aug 14, 2015 at 03:30:18PM -0700, Connor Abbott wrote: >>>>> In some cases, we need to emit ALU instructions with a certain stride >>>>> due to a HW limitation. When splitting that instruction, we need to >>>>> respect the original stride when creating the temporaries we load from >>>>> and store into. Otherwise, we'll reintroduce the problem we were trying >>>>> to work around. >>>>> >>>>> Signed-off-by: Connor Abbott <connor.w.abb...@intel.com> >>>>> --- >>>>> src/mesa/drivers/dri/i965/brw_fs.cpp | 14 ++++++++++---- >>>>> 1 file changed, 10 insertions(+), 4 deletions(-) >>>>> >>>>> diff --git a/src/mesa/drivers/dri/i965/brw_fs.cpp >>>>> b/src/mesa/drivers/dri/i965/brw_fs.cpp >>>>> index 812648f..386e9a2 100644 >>>>> --- a/src/mesa/drivers/dri/i965/brw_fs.cpp >>>>> +++ b/src/mesa/drivers/dri/i965/brw_fs.cpp >>>>> @@ -2370,12 +2370,14 @@ fs_visitor::opt_register_renaming() >>>>> >>>>> if (depth == 0 && >>>>> inst->dst.file == GRF && >>>>> - alloc.sizes[inst->dst.reg] == inst->exec_size / 8 && >>>>> + alloc.sizes[inst->dst.reg] == >>>>> + inst->dst.component_size(inst->exec_size) && >>>>> !inst->is_partial_write()) { >>>>> if (remap[dst] == -1) { >>>>> remap[dst] = dst; >>>>> } else { >>>>> - remap[dst] = alloc.allocate(inst->exec_size / 8); >>>>> + remap[dst] = >>>>> + alloc.allocate(inst->dst.component_size(inst->exec_size)); >>>>> inst->dst.reg = remap[dst]; >>>>> progress = true; >>>>> } >>>>> @@ -4334,6 +4336,8 @@ fs_visitor::lower_simd_width() >>>>> * temporary passed as source to the lowered >>>>> instruction. >>>>> */ >>>>> split_inst.src[j] = lbld.vgrf(inst->src[j].type, >>>>> src_size); >>>>> + split_inst.src[j].subreg_offset = >>>>> inst->src[j].subreg_offset; >>>> >>>> The fixes for the wider component size (64-bits) and the stride are clearly >>>> needed for doubles. I'm wondering though why the sub-register offset hasn't >>>> caused us any problems before. That change is not needed just for doubles, >>>> is it? >>> >>> I think we haven't noticed this before for two reasons: >>> >>> 1. It usually isn't needed for correctness. >>> 2. We don't really use subreg_offset that much anyways. >>> >>> The splitting code is moving the inputs into a bunch of temporary >>> registers, doing the split operation into a new bunch of temporaries, >>> and then moving those temporaries back into the destination register. >>> It's copying over the stride and subreg_offset fields from the >>> original inputs and outputs, so when all is said and done, it's >>> producing the same result and using the same sources. The difference >>> is that the split instructions themselves weren't copying over those >>> fields, meaning that they were always operating on packed registers >>> (stride of 1 and subreg_offset of 0). This is normally ok, except >>> where, due to a restriction, the split instruction *must* have the >>> same stride/offset as the original instruction. For fp64, DF->F >>> conversion must preserve the stride since we have to use a stride of 2 >>> for the output. The only thing I can think of that has to preserve the >>> subreg_offset is math instructions, since the source offset has to be >>> the same as the destination offset, but that should be a corner case >>> and I'm not sure whether we'll ever get in a situation where a math >>> instruction has a non-zero subreg_offset. >>> >> The SIMD lowering pass always uses a suboffset of zero for all >> non-scalar sources of the lowered instructions. I'm not aware of any >> case in which that could be a problem. >> >>> But although I don't think it's necessary for correctness, but I think >>> it should help with register coalescing. For example, take >>> pack/unpackDouble2x32. I've implemented it in terms of two strided >>> MOV's, which need to be split by this pass. Say we have something >>> like: >>> >>> uint hi = unpackDouble2x32(input).y; >>> >>> in SIMD16 mode, this turns into: >>> >>> mov(16) hi.1<2>:UD input:DF >>> >> I don't think this is a valid implementation of unpackDouble2x32. >> The partial write and offset destination shouldn't be necessary and as >> you've already noticed it will prevent a number of optimization passes >> From doing anything useful with your instruction. > > Right, I had my example backwards. It's packDouble2x32 that you have > to emit a MOV like that for. It would be something like: > > mov(16) output.1<2>:UD input_y:UD >
And how exactly does it help the optimizer to introduce additional partial writes? AFAICT this should already be simplified without any changes to the current copy propagation pass: | mov(8) temp1:UD input:UD 1Q | mov(8) temp2:UD input+1.0:UD 2Q | mov(8) temp3<1>:UD temp1:UD 1Q | mov(8) temp4<1>:UD temp2:UD 2Q | mov(8) output+0.1<2>:UD temp3:UD WE_all | mov(8) output+1.1<2>:UD temp4:UD WE_all into: | mov(8) output+0.1<2>:UD input:UD WE_all | mov(8) output+1.1<2>:UD input+1.0:UD WE_all However what you seem to be suggesting: | mov(8) temp1:UD input:UD 1Q | mov(8) temp2:UD input+1.0:UD 2Q | mov(8) temp3+0.1<2>:UD temp1:UD 1Q | mov(8) temp4+0.1<2>:UD temp2:UD 2Q | mov(8) output+0.1<2>:UD temp3+0.1<2>:UD WE_all | mov(8) output+1.1<2>:UD temp4+0.1<2>:UD WE_all Is unlikely to be simplified further than: | mov(8) temp3+0.1<2>:UD input:UD 1Q | mov(8) temp4+0.1<2>:UD input+1.0:UD 2Q | mov(8) output+0.1<2>:UD temp3+0.1<2>:UD WE_all | mov(8) output+1.1<2>:UD temp4+0.1<2>:UD WE_all without introducing a whole new class of copy propagation via partial writes, which is going to be dubiously useful in the long run if we're planning to migrate the compiler back-end to SSA eventually. I think ideally you would implement packDouble2x32 using an SSA-frendly pseudo-instruction, like: | PACK(16) dst:UQ src0:UD src1:UD It would be cool if it could take as many sources as required to fill the destination type, e.g. it would be useful for the image_load_store to be able to do: | PACK(16) dst:UD src0:UB src1:UB src2:UB src3:UB It could probably replace the similar but less general VEC4 instruction. This is currently handled in the image packing and unpacking code using bitwise arithmetic precisely to avoid partial writes. >> >> I think you'd want something like (pre-SIMD lowering): >> >> mov(16) lo<1>:UD input.0<16;8,2>:UD >> mov(16) hi<1>:UD input.1<16;8,2>:UD >> >>> But this needs to be split, and after the splitting pass (without >>> fixing up the subreg_offsest like this patch does) and lowering away >>> all the LOAD_PAYLOAD's, it gets turned into something like: >>> >>> mov(8) temp1:DF input:DF 1Q >>> mov(8) temp2:DF input+1.0:DF 2Q >>> mov(8) temp3<2>:UD temp1:DF 1Q >>> mov(8) temp4<2>:UD temp2:DF 2Q >>> mov(8) hi+0.1<2>:UD temp3<2>:UD WE_all //The WE_all is due to a hack >>> in the SIMD lowering code >>> mov(8) hi+1.1<2>:UD temp4<2>:UD WE_all >>> >>> Copy prop should have no problem cleaning up the first two MOV's, but >>> the last two are more of an issue -- we currently don't have a pass >>> that can detect that you can offset all reads/writes of temp3 and >>> temp4 by 1 and coalesce temp3 and temp4 into hi and hi+1, and it would >>> be really hard to write something like that without SSA. But if we >>> pass through the offset while splitting it up, then things match up: >>> >>> mov(8) temp1:DF input:DF 1Q >>> mov(8) temp2:DF input+1.0:DF 2Q >>> mov(8) temp3+0.1<2>:UD temp1:DF 1Q >>> mov(8) temp4+0.1<2>:UD temp2:DF 2Q >>> mov(8) hi+0.1<2>:UD temp3+0.1<2>:UD WE_all >>> mov(8) hi+1.1<2>:UD temp4+0.1<2>:UD WE_all >>> >>> and we should be able to coalesce away all the extra MOV's. >>> >> >> I doubt this will be any easier to simplify than your first snippet. >> You still have strided and offset writes of temp3-4 and hi, so AFAIK >> both copy propagation and register coalesce are going to freak out in >> both cases. > > It should be quite easy to handle this in copy propagation, though -- > allow partial writes (although consider them to write to the whole > register for simplicity) and only propagate if stride/offset match. > It's effectively doing the (more difficult) first part of coalescing > away the MOV, changing the offset and stride of temp3, for you. > I don't think it's about it being difficult to implement... In fact a couple of weeks ago I changed the copy propagation pass to handle arbitrary strided copies. But after some thought I concluded it wasn't worth sending for review because we probably want to eliminate most cases of partial writes from the back-end in the long run anyway, so it seemed like unnecessary complexity. >> IMO we want to avoid partial writes as much as possible, >> and introduce them as late as possible in the optimization loop. I >> think it would be easiest if the visitor wouldn't emit any partial >> writes at all -- even, say, on CHV/BXT with the register alignment >> limitation on instructions with 64-bit execution types in place. Later >> on we could run a register alignment lowering pass that would legalize >> the strides where they aren't supported natively by the hardware, that >> way you would: >> >> - Be able to implement these restrictions in one place rather than >> having to take them into account anywhere you emit or manipulate >> instructions of 64-bit execution type. >> >> - Emit partial writes in cases where you hit some hardware limitation >> *only* -- As you're doing it here the SIMD lowering pass will emit >> partial writes for all instructions that have a strided source (and >> emit additional and most often unnecessary partial writes for >> instructions that are already a partial write). This breaks the >> assumption of this pass that the zipping and unzipping copies are >> cheap to copy-propagate from, and will definitely increase the >> instruction count when this lowering pass is used, regardless of >> whether the lowered instructions actually have any of the hardware >> limitations you are trying to work around. > > I think we can just drop the part that sets the source stride. AFAIK > there aren't any double operations I'm emitting that need a source > with a special stride due to HW limitations; I just put it there for > consistency. I don't think it's actually used, and as you said it's > probably harmful. We still need to keep it around for the > destinations, since there we're going to emit partial writes anyways, > so we might as well emit partial writes that we have a chance to > coalesce. We could add a special lowering pass for DF->F, since that's > the only operation that requires special handling wrt strides, but > that still doesn't solve the coalescing problem with packDouble2x32. As I said earlier this should probably be simplified by copy propagation. > It also won't solve most of the problems you mentioned, since when we > do the lowering to create the partial write, we're still going to emit > a second MOV, and we're still going to want to coalesce it away for > everything where we can, so copy propagation still needs to understand > all the stride restrictions, and those restrictions are where most of > the fixes were. But you do realize that it's only because of your proposed "fix" for copy propagation to propagate via partial writes that it's then going to un-do your lowered strides what then brings you to have to make it aware of these hardware restrictions -- Otherwise if you just implemented them in a lowering pass and then ran copy propagation it wouldn't even consider the partial writes used to legalize the strides as valid propagation candidates AFAIK. > I didn't have to fix anything to understand that a DF->F move needs to > have a destination stride of 2, besides this code -- nothing else > touches it, so the places where I needed to take stride into account > when manipulating instructions of 64-bit type were here and when > emitting the special MOV for DF->F. So in the end, adding a stride > legalization pass would be more code than what I have now, and for not > much benefit. > >> >> And BTW, this patch doesn't take into account the larger number of VGRFs >> you would need to hold the temporaries in strided and offset form. > > It does for destinations but not sources, since we were already using > regs_written for destinations and over-allocating anyways. > No, it's not, it allocates just enough space for the number of vector components times the lowered width. >> >>> I should probably put some of this in the commit message... >>> >>> Also, I realized that this patch should really be split in two, since >>> the fixes to register renaming are independent of this -- whoops! >>> >>>> >>>>> + split_inst.src[j].stride = inst->src[j].stride; >>>>> emit_transpose(lbld.group(copy_width, 0), >>>>> split_inst.src[j], &src, 1, src_size, >>>>> n); >>>>> } >>>>> @@ -4343,8 +4347,10 @@ fs_visitor::lower_simd_width() >>>>> /* Allocate enough space to hold the result of the lowered >>>>> * instruction and fix up the number of registers written. >>>>> */ >>>>> - split_inst.dst = dsts[i] = >>>>> - lbld.vgrf(inst->dst.type, dst_size); >>>>> + fs_reg dst = lbld.vgrf(inst->dst.type, dst_size); >>>>> + dst.stride = inst->dst.stride; >>>>> + dst.subreg_offset = inst->dst.subreg_offset; >>>>> + split_inst.dst = dsts[i] = dst; >>>>> split_inst.regs_written = >>>>> DIV_ROUND_UP(inst->regs_written * lower_width, >>>>> inst->exec_size); >>>>> -- >>>>> 2.4.3 >>>>> >>>>> _______________________________________________ >>>>> mesa-dev mailing list >>>>> mesa-dev@lists.freedesktop.org >>>>> http://lists.freedesktop.org/mailman/listinfo/mesa-dev >>> _______________________________________________ >>> mesa-dev mailing list >>> mesa-dev@lists.freedesktop.org >>> http://lists.freedesktop.org/mailman/listinfo/mesa-dev
signature.asc
Description: PGP signature
_______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev