On Fri, Apr 26, 2019 at 02:43:44PM +0800, Kewen.Lin wrote:
> > Does it create worse code now?  What we have before your patch isn't
> > so super either (it has an sldi in the loop, it has two mtctr too).
> > Maybe you can show the generated code?
> 
> It's a good question! From the generated codes for the core loop, the 
> code after my patch doesn't have bdnz to leverage hardware CTR, it has
> extra cmpld and branch instead, looks worse.  But I wrote a tiny case
> to invoke the foo and evaluated the running time, they are equal.
> 
> * Measured time:
>   After:
>     real 199.47
>     user 198.35
>     sys 1.11
>   Before:
>     real 199.19
>     user 198.56
>     sys 0.62

Before:

> .L3:                  # core loop
>         stw 10,0(8)   
>         addi 8,8,-1024
>         bdnz .L3

So it didn't use an update instruction here, although it could.  Not that
that changes anything: it would still be three cycles per iteration (that's
the minimum for any loop: instruction fetch is the bottleneck).

After:

> .L3:                      # core loop
>         stw 8,0(9)
>         addi 9,9,-1024
>         cmpld 0,9,10      # cmp 
>         beqlr 0           # if eq, return
>         stw 8,0(9)     
>         addi 9,9,-1024   
>         cmpld 0,9,10      # cmp again 
>         bne 0,.L3         # if ne, jump to L3.

This is unrolled a factor 2.  It should be faster, unfortunately it updates
r9 twice per unrolled loop, making the dependency chain too long.

The bdnz loop could be like

0:
        stw 10,-1024(8)
        bdzlr
        stwu 10,-2048(8)
        bdnz 0b

or similar.  There are multiple problems before we can get that  :-)
(The important one is that the pointer (r8 here) should be updated only
once per unrolled loop iteration; just like in the version without bdnz.
Using bdzlr and stwu is just niceties, compared to that).


> I practiced whether we can adjust the decision made in ivopts.

[ snip ]

> Need more investigation.

Yeah.


Segher

Reply via email to