Very interesting, particularly the outrageous assembly for
pmap_{zero,copy}_page().Is there some way to tell the compiler that the address is already 4096-aligned and avoid the conditionals? Failing that, we could just adopt the FreeBSD assembly for this. Does anyone see a problem with introducing a vfs.timestamp_precision to avoid the rtdscp? Jaromir Le dim. 19 juil. 2020 à 13:21, Mateusz Guzik <[email protected]> a écrit : > > Hello, > > I recently took an opportunity to run cross-systems microbenchmarks > with will-it-scale and included NetBSD (amd64). > > https://people.freebsd.org/~mjg/freebsd-dragonflybsd-netbsd-v2.txt > [no linux in this doc, I will probably create a new one soon(tm)] > > The system has a lot of problems in the vfs layer, vm is a mixed bag > with multithreaded cases lagging behind and some singlethreaded being > pretty good (and at least one winning against the other systems). > > Notes: > - rtdscp is very expensive in vms, yet the kernel unconditionally > performs by calling vfs_timestamp. Both FreeBSD and DragonflyBSD have > a knob to change the resolution (and consequently avoid the > instruction), I think you should introduce it and default to less > accuracy on vms. Sample results: > stock pipe1: 2413901 > patched pipe1: 3147312 > stock vfsmix: 13889 > patched vfsmix: 73477 > - sched_yield is apparently a nop when the binary is not linked with > pthread. this does not match other systems and is probably a bug. > - pmap_zero_page/pmap_copy_page compile to atrocious code which keeps > checking for alignment. The compiler does not know what values can be > assigned to pmap_direct_base and improvises. > > 0xffffffff805200c3 <+0>: add 0xf93b46(%rip),%rdi # > 0xffffffff814b3c10 <pmap_direct_base> > 0xffffffff805200ca <+7>: mov $0x1000,%edx > 0xffffffff805200cf <+12>: xor %eax,%eax > 0xffffffff805200d1 <+14>: test $0x1,%dil > 0xffffffff805200d5 <+18>: jne 0xffffffff805200ff > <pmap_zero_page+60> > 0xffffffff805200d7 <+20>: test $0x2,%dil > 0xffffffff805200db <+24>: jne 0xffffffff8052010b > <pmap_zero_page+72> > 0xffffffff805200dd <+26>: test $0x4,%dil > 0xffffffff805200e1 <+30>: jne 0xffffffff80520116 > <pmap_zero_page+83> > 0xffffffff805200e3 <+32>: mov %edx,%ecx > 0xffffffff805200e5 <+34>: shr $0x3,%ecx > 0xffffffff805200e8 <+37>: rep stos %rax,%es:(%rdi) > 0xffffffff805200eb <+40>: test $0x4,%dl > 0xffffffff805200ee <+43>: je 0xffffffff805200f1 > <pmap_zero_page+46> > 0xffffffff805200f0 <+45>: stos %eax,%es:(%rdi) > 0xffffffff805200f1 <+46>: test $0x2,%dl > 0xffffffff805200f4 <+49>: je 0xffffffff805200f8 > <pmap_zero_page+53> > 0xffffffff805200f6 <+51>: stos %ax,%es:(%rdi) > 0xffffffff805200f8 <+53>: and $0x1,%edx > 0xffffffff805200fb <+56>: je 0xffffffff805200fe > <pmap_zero_page+59> > 0xffffffff805200fd <+58>: stos %al,%es:(%rdi) > 0xffffffff805200fe <+59>: retq > 0xffffffff805200ff <+60>: stos %al,%es:(%rdi) > 0xffffffff80520100 <+61>: mov $0xfff,%edx > 0xffffffff80520105 <+66>: test $0x2,%dil > 0xffffffff80520109 <+70>: je 0xffffffff805200dd > <pmap_zero_page+26> > 0xffffffff8052010b <+72>: stos %ax,%es:(%rdi) > 0xffffffff8052010d <+74>: sub $0x2,%edx > 0xffffffff80520110 <+77>: test $0x4,%dil > 0xffffffff80520114 <+81>: je 0xffffffff805200e3 > <pmap_zero_page+32> > 0xffffffff80520116 <+83>: stos %eax,%es:(%rdi) > 0xffffffff80520117 <+84>: sub $0x4,%edx > 0xffffffff8052011a <+87>: jmp 0xffffffff805200e3 > <pmap_zero_page+32> > > The thing to do in my opinion is to just provide dedicated asm funcs. > This is the equivalent on FreeBSD (ifunc'ed): > > ENTRY(pagezero_std) > PUSH_FRAME_POINTER > movl $PAGE_SIZE/8,%ecx > xorl %eax,%eax > rep > stosq > POP_FRAME_POINTER > ret > END(pagezero_std) > > ENTRY(pagezero_erms) > PUSH_FRAME_POINTER > movl $PAGE_SIZE,%ecx > xorl %eax,%eax > rep > stosb > POP_FRAME_POINTER > ret > END(pagezero_erms) > > -- > Mateusz Guzik <mjguzik gmail.com> >
