Hi Segher, on 2020/9/2 下午6:25, Segher Boessenkool wrote: > Hi! > > On Wed, Sep 02, 2020 at 11:16:00AM +0800, Kewen.Lin wrote: >> on 2020/9/1 上午3:41, Segher Boessenkool wrote: >>> On Tue, Aug 25, 2020 at 08:46:55PM +0800, Kewen.Lin wrote: >>>> 1) Currently address_cost hook on rs6000 always return zero, but at least >>>> from Power7, pre_inc/pre_dec kind instructions are cracked, it means we >>>> have to take the address update into account (scalar normal operation). >>> >>> From Power4 on already (not sure about Power6, but does anyone care?) >> >> Thanks for the information, it looks this issue exists for a long time. > > Well, *is* it an issue? The addressing doesn't get more expensive... > For example, an > ldu 3,16(4) > is cracked to an > ld 3,16(4) > and an > addi 4,4,16 > (the addi is not on the critical path of the load). So it seems to me > this shouldn't increase the addressing cost at all? (The instruction of > course is really two insns in one.) >
Good question! I agree that they can execute in parallel, but it depends on how we interprete the addressing cost, if it's for required execution resource, I think it's off, since comparing with ld, the ldu has two iops and extra ALU requirement. I'm not sure its usage elsewhere, but in the context of IVOPTs on Power, for one normal candidate, its step cost is 4, the cost for group (1) is zero, total cost is 4 for this combination. for the scenario like: ldx rx, iv // (1) ... iv = iv + step // (2) While for ainc_use candidate (like ldu), its step cost is 4, but the cost for group (1) is (-4 // minus step cost), total cost is 0. It looks to say the step update is free. We can also see (1) and (2) can also execute in parallel (same iteration). If we consider the next iteration, it will have the dependency, but it's the same for ldu. So basically they are similar, I think it's unfair to have this difference in the cost modeling. The cracked addi should have its cost here. Does it make sense? Apart from that, one P9 specific point is that the update form load isn't preferred, the reason is that the instruction can not retire until both parts complete, it can hold up subsequent instructions from retiring. If the addi stalls (starvation), the instruction can not retire and can cause things stuck. It seems also something we can model here? BR, Kewen