Hi Segher, >> Good question! I agree that they can execute in parallel, but it depends >> on how we interprete the addressing cost, if it's for required execution >> resource, I think it's off, since comparing with ld, the ldu has two iops >> and extra ALU requirement. > > OTOH, if you do not use an ldu you need to use a real addi insn, which > gives you all the same cost (plus it takes more code space and decode etc. > resources).
Agreed. > >> I'm not sure its usage elsewhere, but in the >> context of IVOPTs on Power, for one normal candidate, its step cost is 4, >> the cost for group (1) is zero, total cost is 4 for this combination. >> for the scenario like: >> ldx rx, iv // (1) >> ... >> iv = iv + step // (2) >> >> While for ainc_use candidate (like ldu), its step cost is 4, but the cost >> for group (1) is (-4 // minus step cost), total cost is 0. It looks to >> say the step update is free. > > That seems wrong, but the address_cost is used in more places, that is > not where to fix this? Good point, I had this question in mind too, it's used somewhere, one of them even uses one magic number, I planned to check all its usages once started to investigate it. But as your comment below, this hook looks not appropriate. > >> We can also see (1) and (2) can also execute in parallel (same iteration). >> If we consider the next iteration, it will have the dependency, but it's >> the same for ldu. So basically they are similar, I think it's unfair to >> have this difference in the cost modeling. The cracked addi should have >> its cost here. Does it make sense? > > It should have cost, certainly, but not address_cost I think. The total > cost of an ldu should be a tiny bit less than that of ld + that of addi; > the address_cost of ldu should be the same as that of ld. OK, I'll check whether there is some other way suitable for this in the context of IVOPTs. Good to see that we agree on the current modeling is a bit off on Power. :) > >> Apart from that, one P9 specific point is that the update form load isn't >> preferred, the reason is that the instruction can not retire until both >> parts complete, it can hold up subsequent instructions from retiring. >> If the addi stalls (starvation), the instruction can not retire and can >> cause things stuck. It seems also something we can model here? > > This is (almost) no problem on p9, since we no longer have issue groups. > It can hold up older insns from retiring, sure, but they *will* have > finished, and p9 can retire 64 insns per cycle. The "completion wall" > is gone. The only problem is if things stick around so long that > resources run out... but you're talking 100s of insns there. > Theoretically it's fine, but the addi starvation was observed in the FP/SIMD instructions intensive loop code, which did cause some worse performance. :( BR, Kewen