On Fri, Apr 26, 2019 at 02:43:44PM +0800, Kewen.Lin wrote: > > Does it create worse code now? What we have before your patch isn't > > so super either (it has an sldi in the loop, it has two mtctr too). > > Maybe you can show the generated code? > > It's a good question! From the generated codes for the core loop, the > code after my patch doesn't have bdnz to leverage hardware CTR, it has > extra cmpld and branch instead, looks worse. But I wrote a tiny case > to invoke the foo and evaluated the running time, they are equal. > > * Measured time: > After: > real 199.47 > user 198.35 > sys 1.11 > Before: > real 199.19 > user 198.56 > sys 0.62
Before: > .L3: # core loop > stw 10,0(8) > addi 8,8,-1024 > bdnz .L3 So it didn't use an update instruction here, although it could. Not that that changes anything: it would still be three cycles per iteration (that's the minimum for any loop: instruction fetch is the bottleneck). After: > .L3: # core loop > stw 8,0(9) > addi 9,9,-1024 > cmpld 0,9,10 # cmp > beqlr 0 # if eq, return > stw 8,0(9) > addi 9,9,-1024 > cmpld 0,9,10 # cmp again > bne 0,.L3 # if ne, jump to L3. This is unrolled a factor 2. It should be faster, unfortunately it updates r9 twice per unrolled loop, making the dependency chain too long. The bdnz loop could be like 0: stw 10,-1024(8) bdzlr stwu 10,-2048(8) bdnz 0b or similar. There are multiple problems before we can get that :-) (The important one is that the pointer (r8 here) should be updated only once per unrolled loop iteration; just like in the version without bdnz. Using bdzlr and stwu is just niceties, compared to that). > I practiced whether we can adjust the decision made in ivopts. [ snip ] > Need more investigation. Yeah. Segher