Le 02/11/2023 à 12:39, Michael Ellerman a écrit : > Matthew Wilcox <wi...@infradead.org> writes: >> On Tue, Oct 24, 2023 at 08:06:04PM +0530, Aneesh Kumar K.V wrote: >>> ptep++; >>> - pte = __pte(pte_val(pte) + (1UL << PTE_RPN_SHIFT)); >>> addr += PAGE_SIZE; >>> + /* >>> + * increment the pfn. >>> + */ >>> + pte = pfn_pte(pte_pfn(pte) + 1, pte_pgprot((pte))); >> >> when i looked at this, it generated shit code. did you check? > > I didn't look ... > > <goes and looks> > > It's not super clear cut. There's some difference because pfn_pte() > contains two extra VM_BUG_ONs. > > But with DEBUG_VM *off* the version using pfn_pte() generates *better* > code, or at least less code, ~160 instructions vs ~200. > > For some reason the version using PTE_RPN_SHIFT seems to be byte > swapping the pte an extra two times, each of which generates ~8 > instructions. But I can't see why. > > I tried a few other things and couldn't come up with anything that > generated better code. But I'll keep poking at it tomorrow.
On PPC32 the version using PTE_RPN_SHIFT is better, here is what the main loop of set_ptes() looks like: 22c: 55 29 f0 be srwi r9,r9,2 230: 7d 29 03 a6 mtctr r9 234: 39 3f 10 00 addi r9,r31,4096 238: 39 1f 20 00 addi r8,r31,8192 23c: 39 5f 30 00 addi r10,r31,12288 240: 3b ff 40 00 addi r31,r31,16384 244: 91 3e 00 04 stw r9,4(r30) 248: 91 1e 00 08 stw r8,8(r30) 24c: 91 5e 00 0c stw r10,12(r30) 250: 97 fe 00 10 stwu r31,16(r30) 254: 42 00 ff e0 bdnz 234 <set_ptes+0x78> With the version using pfn_pte(), the main loop is: 218: 54 e9 f8 7e srwi r9,r7,1 21c: 7d 29 03 a6 mtctr r9 220: 57 e9 00 26 clrrwi r9,r31,12 224: 39 29 10 00 addi r9,r9,4096 228: 57 ff 05 3e clrlwi r31,r31,20 22c: 7d 29 fb 78 or r9,r9,r31 230: 55 3f 00 26 clrrwi r31,r9,12 234: 3b ff 10 00 addi r31,r31,4096 238: 55 28 05 3e clrlwi r8,r9,20 23c: 7f ff 43 78 or r31,r31,r8 240: 91 3d 00 04 stw r9,4(r29) 244: 93 fd 00 08 stw r31,8(r29) 248: 3b bd 00 08 addi r29,r29,8 24c: 42 00 ff d4 bdnz 220 <set_ptes+0x64> Not only the loop is bigger, but it is also only unrolled by 2 while first one is unrolled by 4 (r7 and r9 contain the same value). Therefore allthough the PTE_RPN_SHIFT version is 87 instructions while the other one is only 81 instructions, the former looks better. Christophe