Le 02/11/2023 à 12:39, Michael Ellerman a écrit :
> Matthew Wilcox <wi...@infradead.org> writes:
>> On Tue, Oct 24, 2023 at 08:06:04PM +0530, Aneesh Kumar K.V wrote:
>>>             ptep++;
>>> -           pte = __pte(pte_val(pte) + (1UL << PTE_RPN_SHIFT));
>>>             addr += PAGE_SIZE;
>>> +           /*
>>> +            * increment the pfn.
>>> +            */
>>> +           pte = pfn_pte(pte_pfn(pte) + 1, pte_pgprot((pte)));
>>
>> when i looked at this, it generated shit code.  did you check?
> 
> I didn't look ...
> 
> <goes and looks>
> 
> It's not super clear cut. There's some difference because pfn_pte()
> contains two extra VM_BUG_ONs.
> 
> But with DEBUG_VM *off* the version using pfn_pte() generates *better*
> code, or at least less code, ~160 instructions vs ~200.
> 
> For some reason the version using PTE_RPN_SHIFT seems to be byte
> swapping the pte an extra two times, each of which generates ~8
> instructions. But I can't see why.
> 
> I tried a few other things and couldn't come up with anything that
> generated better code. But I'll keep poking at it tomorrow.

On PPC32 the version using PTE_RPN_SHIFT is better, here is what the 
main loop of set_ptes() looks like:

  22c:  55 29 f0 be     srwi    r9,r9,2
  230:  7d 29 03 a6     mtctr   r9
  234:  39 3f 10 00     addi    r9,r31,4096
  238:  39 1f 20 00     addi    r8,r31,8192
  23c:  39 5f 30 00     addi    r10,r31,12288
  240:  3b ff 40 00     addi    r31,r31,16384
  244:  91 3e 00 04     stw     r9,4(r30)
  248:  91 1e 00 08     stw     r8,8(r30)
  24c:  91 5e 00 0c     stw     r10,12(r30)
  250:  97 fe 00 10     stwu    r31,16(r30)
  254:  42 00 ff e0     bdnz    234 <set_ptes+0x78>

With the version using pfn_pte(), the main loop is:

  218:  54 e9 f8 7e     srwi    r9,r7,1
  21c:  7d 29 03 a6     mtctr   r9
  220:  57 e9 00 26     clrrwi  r9,r31,12
  224:  39 29 10 00     addi    r9,r9,4096
  228:  57 ff 05 3e     clrlwi  r31,r31,20
  22c:  7d 29 fb 78     or      r9,r9,r31
  230:  55 3f 00 26     clrrwi  r31,r9,12
  234:  3b ff 10 00     addi    r31,r31,4096
  238:  55 28 05 3e     clrlwi  r8,r9,20
  23c:  7f ff 43 78     or      r31,r31,r8
  240:  91 3d 00 04     stw     r9,4(r29)
  244:  93 fd 00 08     stw     r31,8(r29)
  248:  3b bd 00 08     addi    r29,r29,8
  24c:  42 00 ff d4     bdnz    220 <set_ptes+0x64>

Not only the loop is bigger, but it is also only unrolled by 2 while 
first one is unrolled by 4 (r7 and r9 contain the same value).

Therefore allthough the PTE_RPN_SHIFT version is 87 instructions while 
the other one is only 81 instructions, the former looks better.

Christophe

Reply via email to