Hello, I recently took an opportunity to run cross-systems microbenchmarks with will-it-scale and included NetBSD (amd64).
https://people.freebsd.org/~mjg/freebsd-dragonflybsd-netbsd-v2.txt [no linux in this doc, I will probably create a new one soon(tm)] The system has a lot of problems in the vfs layer, vm is a mixed bag with multithreaded cases lagging behind and some singlethreaded being pretty good (and at least one winning against the other systems). Notes: - rtdscp is very expensive in vms, yet the kernel unconditionally performs by calling vfs_timestamp. Both FreeBSD and DragonflyBSD have a knob to change the resolution (and consequently avoid the instruction), I think you should introduce it and default to less accuracy on vms. Sample results: stock pipe1: 2413901 patched pipe1: 3147312 stock vfsmix: 13889 patched vfsmix: 73477 - sched_yield is apparently a nop when the binary is not linked with pthread. this does not match other systems and is probably a bug. - pmap_zero_page/pmap_copy_page compile to atrocious code which keeps checking for alignment. The compiler does not know what values can be assigned to pmap_direct_base and improvises. 0xffffffff805200c3 <+0>: add 0xf93b46(%rip),%rdi # 0xffffffff814b3c10 <pmap_direct_base> 0xffffffff805200ca <+7>: mov $0x1000,%edx 0xffffffff805200cf <+12>: xor %eax,%eax 0xffffffff805200d1 <+14>: test $0x1,%dil 0xffffffff805200d5 <+18>: jne 0xffffffff805200ff <pmap_zero_page+60> 0xffffffff805200d7 <+20>: test $0x2,%dil 0xffffffff805200db <+24>: jne 0xffffffff8052010b <pmap_zero_page+72> 0xffffffff805200dd <+26>: test $0x4,%dil 0xffffffff805200e1 <+30>: jne 0xffffffff80520116 <pmap_zero_page+83> 0xffffffff805200e3 <+32>: mov %edx,%ecx 0xffffffff805200e5 <+34>: shr $0x3,%ecx 0xffffffff805200e8 <+37>: rep stos %rax,%es:(%rdi) 0xffffffff805200eb <+40>: test $0x4,%dl 0xffffffff805200ee <+43>: je 0xffffffff805200f1 <pmap_zero_page+46> 0xffffffff805200f0 <+45>: stos %eax,%es:(%rdi) 0xffffffff805200f1 <+46>: test $0x2,%dl 0xffffffff805200f4 <+49>: je 0xffffffff805200f8 <pmap_zero_page+53> 0xffffffff805200f6 <+51>: stos %ax,%es:(%rdi) 0xffffffff805200f8 <+53>: and $0x1,%edx 0xffffffff805200fb <+56>: je 0xffffffff805200fe <pmap_zero_page+59> 0xffffffff805200fd <+58>: stos %al,%es:(%rdi) 0xffffffff805200fe <+59>: retq 0xffffffff805200ff <+60>: stos %al,%es:(%rdi) 0xffffffff80520100 <+61>: mov $0xfff,%edx 0xffffffff80520105 <+66>: test $0x2,%dil 0xffffffff80520109 <+70>: je 0xffffffff805200dd <pmap_zero_page+26> 0xffffffff8052010b <+72>: stos %ax,%es:(%rdi) 0xffffffff8052010d <+74>: sub $0x2,%edx 0xffffffff80520110 <+77>: test $0x4,%dil 0xffffffff80520114 <+81>: je 0xffffffff805200e3 <pmap_zero_page+32> 0xffffffff80520116 <+83>: stos %eax,%es:(%rdi) 0xffffffff80520117 <+84>: sub $0x4,%edx 0xffffffff8052011a <+87>: jmp 0xffffffff805200e3 <pmap_zero_page+32> The thing to do in my opinion is to just provide dedicated asm funcs. This is the equivalent on FreeBSD (ifunc'ed): ENTRY(pagezero_std) PUSH_FRAME_POINTER movl $PAGE_SIZE/8,%ecx xorl %eax,%eax rep stosq POP_FRAME_POINTER ret END(pagezero_std) ENTRY(pagezero_erms) PUSH_FRAME_POINTER movl $PAGE_SIZE,%ecx xorl %eax,%eax rep stosb POP_FRAME_POINTER ret END(pagezero_erms) -- Mateusz Guzik <mjguzik gmail.com>
