Connor Abbott <cwabbo...@gmail.com> writes: > On Tue, Aug 18, 2015 at 2:29 AM, Francisco Jerez <curroje...@riseup.net> > wrote: >> Connor Abbott <cwabbo...@gmail.com> writes: >> >>> On Mon, Aug 17, 2015 at 2:41 AM, Francisco Jerez <curroje...@riseup.net> >>> wrote: >>>> Connor Abbott <cwabbo...@gmail.com> writes: >>>> >>>>> On Mon, Aug 17, 2015 at 12:09 AM, Pohjolainen, Topi >>>>> <topi.pohjolai...@intel.com> wrote: >>>>>> On Fri, Aug 14, 2015 at 03:30:18PM -0700, Connor Abbott wrote: >>>>>>> In some cases, we need to emit ALU instructions with a certain stride >>>>>>> due to a HW limitation. When splitting that instruction, we need to >>>>>>> respect the original stride when creating the temporaries we load from >>>>>>> and store into. Otherwise, we'll reintroduce the problem we were trying >>>>>>> to work around. >>>>>>> >>>>>>> Signed-off-by: Connor Abbott <connor.w.abb...@intel.com> >>>>>>> --- >>>>>>> src/mesa/drivers/dri/i965/brw_fs.cpp | 14 ++++++++++---- >>>>>>> 1 file changed, 10 insertions(+), 4 deletions(-) >>>>>>> >>>>>>> diff --git a/src/mesa/drivers/dri/i965/brw_fs.cpp >>>>>>> b/src/mesa/drivers/dri/i965/brw_fs.cpp >>>>>>> index 812648f..386e9a2 100644 >>>>>>> --- a/src/mesa/drivers/dri/i965/brw_fs.cpp >>>>>>> +++ b/src/mesa/drivers/dri/i965/brw_fs.cpp >>>>>>> @@ -2370,12 +2370,14 @@ fs_visitor::opt_register_renaming() >>>>>>> >>>>>>> if (depth == 0 && >>>>>>> inst->dst.file == GRF && >>>>>>> - alloc.sizes[inst->dst.reg] == inst->exec_size / 8 && >>>>>>> + alloc.sizes[inst->dst.reg] == >>>>>>> + inst->dst.component_size(inst->exec_size) && >>>>>>> !inst->is_partial_write()) { >>>>>>> if (remap[dst] == -1) { >>>>>>> remap[dst] = dst; >>>>>>> } else { >>>>>>> - remap[dst] = alloc.allocate(inst->exec_size / 8); >>>>>>> + remap[dst] = >>>>>>> + >>>>>>> alloc.allocate(inst->dst.component_size(inst->exec_size)); >>>>>>> inst->dst.reg = remap[dst]; >>>>>>> progress = true; >>>>>>> } >>>>>>> @@ -4334,6 +4336,8 @@ fs_visitor::lower_simd_width() >>>>>>> * temporary passed as source to the lowered >>>>>>> instruction. >>>>>>> */ >>>>>>> split_inst.src[j] = lbld.vgrf(inst->src[j].type, >>>>>>> src_size); >>>>>>> + split_inst.src[j].subreg_offset = >>>>>>> inst->src[j].subreg_offset; >>>>>> >>>>>> The fixes for the wider component size (64-bits) and the stride are >>>>>> clearly >>>>>> needed for doubles. I'm wondering though why the sub-register offset >>>>>> hasn't >>>>>> caused us any problems before. That change is not needed just for >>>>>> doubles, >>>>>> is it? >>>>> >>>>> I think we haven't noticed this before for two reasons: >>>>> >>>>> 1. It usually isn't needed for correctness. >>>>> 2. We don't really use subreg_offset that much anyways. >>>>> >>>>> The splitting code is moving the inputs into a bunch of temporary >>>>> registers, doing the split operation into a new bunch of temporaries, >>>>> and then moving those temporaries back into the destination register. >>>>> It's copying over the stride and subreg_offset fields from the >>>>> original inputs and outputs, so when all is said and done, it's >>>>> producing the same result and using the same sources. The difference >>>>> is that the split instructions themselves weren't copying over those >>>>> fields, meaning that they were always operating on packed registers >>>>> (stride of 1 and subreg_offset of 0). This is normally ok, except >>>>> where, due to a restriction, the split instruction *must* have the >>>>> same stride/offset as the original instruction. For fp64, DF->F >>>>> conversion must preserve the stride since we have to use a stride of 2 >>>>> for the output. The only thing I can think of that has to preserve the >>>>> subreg_offset is math instructions, since the source offset has to be >>>>> the same as the destination offset, but that should be a corner case >>>>> and I'm not sure whether we'll ever get in a situation where a math >>>>> instruction has a non-zero subreg_offset. >>>>> >>>> The SIMD lowering pass always uses a suboffset of zero for all >>>> non-scalar sources of the lowered instructions. I'm not aware of any >>>> case in which that could be a problem. >>>> >>>>> But although I don't think it's necessary for correctness, but I think >>>>> it should help with register coalescing. For example, take >>>>> pack/unpackDouble2x32. I've implemented it in terms of two strided >>>>> MOV's, which need to be split by this pass. Say we have something >>>>> like: >>>>> >>>>> uint hi = unpackDouble2x32(input).y; >>>>> >>>>> in SIMD16 mode, this turns into: >>>>> >>>>> mov(16) hi.1<2>:UD input:DF >>>>> >>>> I don't think this is a valid implementation of unpackDouble2x32. >>>> The partial write and offset destination shouldn't be necessary and as >>>> you've already noticed it will prevent a number of optimization passes >>>> From doing anything useful with your instruction. >>> >>> Right, I had my example backwards. It's packDouble2x32 that you have >>> to emit a MOV like that for. It would be something like: >>> >>> mov(16) output.1<2>:UD input_y:UD >>> >> >> And how exactly does it help the optimizer to introduce additional >> partial writes? AFAICT this should already be simplified without any >> changes to the current copy propagation pass: >> >> | mov(8) temp1:UD input:UD 1Q >> | mov(8) temp2:UD input+1.0:UD 2Q >> | mov(8) temp3<1>:UD temp1:UD 1Q >> | mov(8) temp4<1>:UD temp2:UD 2Q >> | mov(8) output+0.1<2>:UD temp3:UD WE_all >> | mov(8) output+1.1<2>:UD temp4:UD WE_all >> >> into: >> >> | mov(8) output+0.1<2>:UD input:UD WE_all >> | mov(8) output+1.1<2>:UD input+1.0:UD WE_all > > I think you're right, since we happen to be dealing with MOV's that > can be coalesed away. > >> >> However what you seem to be suggesting: >> >> | mov(8) temp1:UD input:UD 1Q >> | mov(8) temp2:UD input+1.0:UD 2Q >> | mov(8) temp3+0.1<2>:UD temp1:UD 1Q >> | mov(8) temp4+0.1<2>:UD temp2:UD 2Q >> | mov(8) output+0.1<2>:UD temp3+0.1<2>:UD WE_all >> | mov(8) output+1.1<2>:UD temp4+0.1<2>:UD WE_all >> >> Is unlikely to be simplified further than: >> >> | mov(8) temp3+0.1<2>:UD input:UD 1Q >> | mov(8) temp4+0.1<2>:UD input+1.0:UD 2Q >> | mov(8) output+0.1<2>:UD temp3+0.1<2>:UD WE_all >> | mov(8) output+1.1<2>:UD temp4+0.1<2>:UD WE_all >> >> without introducing a whole new class of copy propagation via partial >> writes, which is going to be dubiously useful in the long run if we're >> planning to migrate the compiler back-end to SSA eventually. >> >> I think ideally you would implement packDouble2x32 using an SSA-frendly >> pseudo-instruction, like: >> >> | PACK(16) dst:UQ src0:UD src1:UD >> >> It would be cool if it could take as many sources as required to fill >> the destination type, e.g. it would be useful for the image_load_store >> to be able to do: >> >> | PACK(16) dst:UD src0:UB src1:UB src2:UB src3:UB >> >> It could probably replace the similar but less general VEC4 instruction. >> This is currently handled in the image packing and unpacking code using >> bitwise arithmetic precisely to avoid partial writes. > > Yeah, we'll definitely need something like that when we get to > SSA-ifying things. I don't see any particular benefits that it gives > us until then, though, and you haven't pointed out any. There are so > many things we want/need to change with the move to SSA: > > - Fixing up the frontend to do predicated things with select > instructions, and coalescing those away. > - Doing SSA with flag registers, so that we can more freely move CMP > instructions around and take advantage of all the flag registers > instead of hardcoding one for discard. > - Having the type, quarter control, and exec_all be on the SSA > destination and validating that things are consistent so that figuring > out interference is easier and we can have a LOAD_PAYLOAD that has > different sources with different quarter controls. > - Handling strides like what you suggest, where there are no partial > writes and no strides and perhaps only limited partial reads through > retyping sources to a smaller type (not clear whether having this as a > dedicated opcode is easier), and there's a separate pass that figures > out what the stride should be based on HW restrictions. > - Doing SSA-based register allocation and somehow dealing with > incompatible exec_mask's, and as a consequence, actually doing real > spilling based on register pressure. > - Probably something I forgot. > > and so many of them are interconnected and won't even make sense until > we have SSA, that it's not always useful to think about the IR we want > to have in the distant future in order to decide what to do today. > We've talked about scrapping FS and completely re-writing it, and even > if we don't, there's going to be a *massive* amount of churn. I think > you've managed to convince me that having a separate lowering pass > that deals with the DF->F workaround and dropping this patch is > useful, since it avoids having to do copy propagation with arbitrary > strides/offsets, but unless there's some other practical benefit, I'm > not inclined to add more stuff that will only be useful for SSA. If > the SSA-based thing is going to be useful today, then sure, let's do > it, but otherwise it's just adding dead code. >
I thought that making the FS back-end more SSA-friendly incrementally was the main point of e.g. the LOAD_PAYLOAD instruction. Such a PACK instruction would be immediately useful for the image load/store implementation, for your implementation of the double-packing built-ins, for the FS implementation of some other GLSL packing built-ins, and for the VEC4 back-end (in fact the VEC4 back-end already has a similar instruction but it's somewhat less general than what I described). Regardless of the back-end IR being SSA-form or not it would likely make things easier to CSE than a sequence of partial writes. Anyway I wasn't suggesting that you should necessarily re-implement your DF packing built-ins in terms of such an instruction, it's just something that has been on my wishlist for quite a while and I was hoping you'd like the idea enough to go ahead and implement it. ;) > >> >>>> >>>> I think you'd want something like (pre-SIMD lowering): >>>> >>>> mov(16) lo<1>:UD input.0<16;8,2>:UD >>>> mov(16) hi<1>:UD input.1<16;8,2>:UD >>>> >>>>> But this needs to be split, and after the splitting pass (without >>>>> fixing up the subreg_offsest like this patch does) and lowering away >>>>> all the LOAD_PAYLOAD's, it gets turned into something like: >>>>> >>>>> mov(8) temp1:DF input:DF 1Q >>>>> mov(8) temp2:DF input+1.0:DF 2Q >>>>> mov(8) temp3<2>:UD temp1:DF 1Q >>>>> mov(8) temp4<2>:UD temp2:DF 2Q >>>>> mov(8) hi+0.1<2>:UD temp3<2>:UD WE_all //The WE_all is due to a hack >>>>> in the SIMD lowering code >>>>> mov(8) hi+1.1<2>:UD temp4<2>:UD WE_all >>>>> >>>>> Copy prop should have no problem cleaning up the first two MOV's, but >>>>> the last two are more of an issue -- we currently don't have a pass >>>>> that can detect that you can offset all reads/writes of temp3 and >>>>> temp4 by 1 and coalesce temp3 and temp4 into hi and hi+1, and it would >>>>> be really hard to write something like that without SSA. But if we >>>>> pass through the offset while splitting it up, then things match up: >>>>> >>>>> mov(8) temp1:DF input:DF 1Q >>>>> mov(8) temp2:DF input+1.0:DF 2Q >>>>> mov(8) temp3+0.1<2>:UD temp1:DF 1Q >>>>> mov(8) temp4+0.1<2>:UD temp2:DF 2Q >>>>> mov(8) hi+0.1<2>:UD temp3+0.1<2>:UD WE_all >>>>> mov(8) hi+1.1<2>:UD temp4+0.1<2>:UD WE_all >>>>> >>>>> and we should be able to coalesce away all the extra MOV's. >>>>> >>>> >>>> I doubt this will be any easier to simplify than your first snippet. >>>> You still have strided and offset writes of temp3-4 and hi, so AFAIK >>>> both copy propagation and register coalesce are going to freak out in >>>> both cases. >>> >>> It should be quite easy to handle this in copy propagation, though -- >>> allow partial writes (although consider them to write to the whole >>> register for simplicity) and only propagate if stride/offset match. >>> It's effectively doing the (more difficult) first part of coalescing >>> away the MOV, changing the offset and stride of temp3, for you. >>> >> I don't think it's about it being difficult to implement... In fact a >> couple of weeks ago I changed the copy propagation pass to handle >> arbitrary strided copies. But after some thought I concluded it wasn't >> worth sending for review because we probably want to eliminate most >> cases of partial writes from the back-end in the long run anyway, so it >> seemed like unnecessary complexity. >> >>>> IMO we want to avoid partial writes as much as possible, >>>> and introduce them as late as possible in the optimization loop. I >>>> think it would be easiest if the visitor wouldn't emit any partial >>>> writes at all -- even, say, on CHV/BXT with the register alignment >>>> limitation on instructions with 64-bit execution types in place. Later >>>> on we could run a register alignment lowering pass that would legalize >>>> the strides where they aren't supported natively by the hardware, that >>>> way you would: >>>> >>>> - Be able to implement these restrictions in one place rather than >>>> having to take them into account anywhere you emit or manipulate >>>> instructions of 64-bit execution type. >>>> >>>> - Emit partial writes in cases where you hit some hardware limitation >>>> *only* -- As you're doing it here the SIMD lowering pass will emit >>>> partial writes for all instructions that have a strided source (and >>>> emit additional and most often unnecessary partial writes for >>>> instructions that are already a partial write). This breaks the >>>> assumption of this pass that the zipping and unzipping copies are >>>> cheap to copy-propagate from, and will definitely increase the >>>> instruction count when this lowering pass is used, regardless of >>>> whether the lowered instructions actually have any of the hardware >>>> limitations you are trying to work around. >>> >>> I think we can just drop the part that sets the source stride. AFAIK >>> there aren't any double operations I'm emitting that need a source >>> with a special stride due to HW limitations; I just put it there for >>> consistency. I don't think it's actually used, and as you said it's >>> probably harmful. We still need to keep it around for the >>> destinations, since there we're going to emit partial writes anyways, >>> so we might as well emit partial writes that we have a chance to >>> coalesce. We could add a special lowering pass for DF->F, since that's >>> the only operation that requires special handling wrt strides, but >>> that still doesn't solve the coalescing problem with packDouble2x32. >> >> As I said earlier this should probably be simplified by copy >> propagation. >> >>> It also won't solve most of the problems you mentioned, since when we >>> do the lowering to create the partial write, we're still going to emit >>> a second MOV, and we're still going to want to coalesce it away for >>> everything where we can, so copy propagation still needs to understand >>> all the stride restrictions, and those restrictions are where most of >>> the fixes were. >> >> But you do realize that it's only because of your proposed "fix" for >> copy propagation to propagate via partial writes that it's then going to >> un-do your lowered strides what then brings you to have to make it aware >> of these hardware restrictions -- Otherwise if you just implemented them >> in a lowering pass and then ran copy propagation it wouldn't even >> consider the partial writes used to legalize the strides as valid >> propagation candidates AFAIK. > > No, we still need to make copy progagation aware of these > restrictions, even with the lowering pass. For example, say you have > something like: > > MOV(8) temp:F input:DF > SQRT(8) out1:F temp:F //some math instruction that uses temp > ADD(8) out2:F temp:F foo:F //some normal instruction that uses temp > > This is going to get legalized to: > > MOV(8) temp2<2>:F input:DF > MOV(8) temp:F temp2<2>:F > SQRT(8) out1:F temp:F //some math instruction that uses temp > ADD(8) out2:F temp:F foo:F //some normal instruction that uses temp > > Now, we want to copy-propagate into the normal instruction but not the > math instruction. Either we could make copy-propagation aware of this, > like my patch does, or we could add yet another pass that legalizes > math instructions' strides (and sometimes produces code that could be > CSE'd). But you can't do both in the same pass, since it would then > fight with copy propagation. Given the two options, fixing copy > propagation seems easier to me. > Ah, right, I see why it would be useful to fix the copy propagation pass in addition. I guess that to avoid duplicating the logic between copy propagation and the stride lowering pass it would make sense to implement your can_take_stride() helper as an instruction query kind of like backend_instruction::can_do_source_mods(). >> >>> I didn't have to fix anything to understand that a DF->F move needs to >>> have a destination stride of 2, besides this code -- nothing else >>> touches it, so the places where I needed to take stride into account >>> when manipulating instructions of 64-bit type were here and when >>> emitting the special MOV for DF->F. So in the end, adding a stride >>> legalization pass would be more code than what I have now, and for not >>> much benefit. >>> >>>> >>>> And BTW, this patch doesn't take into account the larger number of VGRFs >>>> you would need to hold the temporaries in strided and offset form. >>> >>> It does for destinations but not sources, since we were already using >>> regs_written for destinations and over-allocating anyways. >>> >> No, it's not, it allocates just enough space for the number of vector >> components times the lowered width. >> >>>> >>>>> I should probably put some of this in the commit message... >>>>> >>>>> Also, I realized that this patch should really be split in two, since >>>>> the fixes to register renaming are independent of this -- whoops! >>>>> >>>>>> >>>>>>> + split_inst.src[j].stride = inst->src[j].stride; >>>>>>> emit_transpose(lbld.group(copy_width, 0), >>>>>>> split_inst.src[j], &src, 1, src_size, >>>>>>> n); >>>>>>> } >>>>>>> @@ -4343,8 +4347,10 @@ fs_visitor::lower_simd_width() >>>>>>> /* Allocate enough space to hold the result of the >>>>>>> lowered >>>>>>> * instruction and fix up the number of registers >>>>>>> written. >>>>>>> */ >>>>>>> - split_inst.dst = dsts[i] = >>>>>>> - lbld.vgrf(inst->dst.type, dst_size); >>>>>>> + fs_reg dst = lbld.vgrf(inst->dst.type, dst_size); >>>>>>> + dst.stride = inst->dst.stride; >>>>>>> + dst.subreg_offset = inst->dst.subreg_offset; >>>>>>> + split_inst.dst = dsts[i] = dst; >>>>>>> split_inst.regs_written = >>>>>>> DIV_ROUND_UP(inst->regs_written * lower_width, >>>>>>> inst->exec_size); >>>>>>> -- >>>>>>> 2.4.3 >>>>>>> >>>>>>> _______________________________________________ >>>>>>> mesa-dev mailing list >>>>>>> mesa-dev@lists.freedesktop.org >>>>>>> http://lists.freedesktop.org/mailman/listinfo/mesa-dev >>>>> _______________________________________________ >>>>> mesa-dev mailing list >>>>> mesa-dev@lists.freedesktop.org >>>>> http://lists.freedesktop.org/mailman/listinfo/mesa-dev
signature.asc
Description: PGP signature
_______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev