On Thu, Sep 03, 2020 at 10:24:21AM +0800, Kewen.Lin wrote: > on 2020/9/2 下午6:25, Segher Boessenkool wrote: > > On Wed, Sep 02, 2020 at 11:16:00AM +0800, Kewen.Lin wrote: > >> on 2020/9/1 上午3:41, Segher Boessenkool wrote: > >>> On Tue, Aug 25, 2020 at 08:46:55PM +0800, Kewen.Lin wrote: > >>>> 1) Currently address_cost hook on rs6000 always return zero, but at least > >>>> from Power7, pre_inc/pre_dec kind instructions are cracked, it means we > >>>> have to take the address update into account (scalar normal operation). > >>> > >>> From Power4 on already (not sure about Power6, but does anyone care?) > >> > >> Thanks for the information, it looks this issue exists for a long time. > > > > Well, *is* it an issue? The addressing doesn't get more expensive... > > For example, an > > ldu 3,16(4) > > is cracked to an > > ld 3,16(4) > > and an > > addi 4,4,16 > > (the addi is not on the critical path of the load). So it seems to me > > this shouldn't increase the addressing cost at all? (The instruction of > > course is really two insns in one.) > > Good question! I agree that they can execute in parallel, but it depends > on how we interprete the addressing cost, if it's for required execution > resource, I think it's off, since comparing with ld, the ldu has two iops > and extra ALU requirement.
OTOH, if you do not use an ldu you need to use a real addi insn, which gives you all the same cost (plus it takes more code space and decode etc. resources). > I'm not sure its usage elsewhere, but in the > context of IVOPTs on Power, for one normal candidate, its step cost is 4, > the cost for group (1) is zero, total cost is 4 for this combination. > for the scenario like: > ldx rx, iv // (1) > ... > iv = iv + step // (2) > > While for ainc_use candidate (like ldu), its step cost is 4, but the cost > for group (1) is (-4 // minus step cost), total cost is 0. It looks to > say the step update is free. That seems wrong, but the address_cost is used in more places, that is not where to fix this? > We can also see (1) and (2) can also execute in parallel (same iteration). > If we consider the next iteration, it will have the dependency, but it's > the same for ldu. So basically they are similar, I think it's unfair to > have this difference in the cost modeling. The cracked addi should have > its cost here. Does it make sense? It should have cost, certainly, but not address_cost I think. The total cost of an ldu should be a tiny bit less than that of ld + that of addi; the address_cost of ldu should be the same as that of ld. > Apart from that, one P9 specific point is that the update form load isn't > preferred, the reason is that the instruction can not retire until both > parts complete, it can hold up subsequent instructions from retiring. > If the addi stalls (starvation), the instruction can not retire and can > cause things stuck. It seems also something we can model here? This is (almost) no problem on p9, since we no longer have issue groups. It can hold up older insns from retiring, sure, but they *will* have finished, and p9 can retire 64 insns per cycle. The "completion wall" is gone. The only problem is if things stick around so long that resources run out... but you're talking 100s of insns there. Segher