On Wed, 2024-01-17 at 19:45 +0000, Anton Ivanov wrote: > On 17/01/2024 17:17, Benjamin Berg wrote: > > Hi, > > > > On Wed, 2023-09-27 at 11:52 +0200, Benjamin Berg wrote: > > > [SNIP] > > > Once we are there, we can look for optimizations. The fundamental > > > problem is that page faults (even minor ones) are extremely expensive > > > for us. > > > > > > Just throwing out ideas on what we could do: > > > 1. SECCOMP as that reduces the amount of context switches. > > > (Yes, I know I should resubmit the patchset) > > > 2. Maybe we can disable/cripple page access tracking? If we assume > > > initially mark all pages as accessed by userspace (i.e. > > > pte_mkyoung), then we avoid a minor page fault on first access. > > > Doing that will mess with page eviction though. > > > 3. Do DAX (direct_access) for files. i.e. mmap files directly in the > > > host kernel rather than through UM. > > > With a hostfs like file system, one should be able to add an > > > intermediate block device that maps host files to physical pages, > > > then do DAX in the FS. > > > For disk images, the existing iomem infrastructure should be > > > usable, this should work with any DAX enabled filesystems (ext2, > > > ext4, xfs, virtiofs, erofs). > > > > So, I experimented quite a bit over Christmas (including getting DAX to > > work with virtiofs). At the end of all this my conclusion is that > > insufficient page table synchronization is our main problem. > > > > Basically, right now we rely on the flush_tlb_* functions from the > > kernel, but these are only called when TLB entries are removed, *not* > > when new PTEs are added (there is also update_mmu_cache, but it isn't > > enough either). Effectively this means that new page table entries will > > often only be synced because the userspace code runs into an > > unnecessary segfright now we rely on the flush_tlb_* functions from the > > kernel, but these are only called when TLB entries are removed, *not* > > when new PTEs are added (there is also update_mmu_cache, but it isn't > > enough either). Effectively this means that new page table entries will > > often only be synced because the userspace code runs into an > > unnecessary segfaultault. > > > > Really, what we need is a set_pte_at() implementation that marks the > > memory range for synchronization. Then we can make sure we sync it > > before switching to the userspace process (the equivalent of running > > flush_tlb_mm_range right now). > > > > I think we should: > > * Rewrite the userspace syscall code > > - Support delaying the execution of syscalls > > - Only support mmap/munmap/mprotect and LDT > > - Do simple compression of consecutive syscalls here > > - Drop the hand-written assembler > > * Improve the tlb.c code > > - remove the HVC abstraction > > Cool. That was not working particularly well. I tried to improve it a > few times, but ripping it out and replacing it is probably a better idea.
Hm, now I realise that we still want mmap() syscall compression for the kernel itself in tlb.c. > > - never force immediate syscall execution > > * Let set_pte_at() track which memory ranges that need syncing > > * At that point we should be able to: > > - drop copy_context_skas0 > > - make flush_tlb_* no-ops > > - drop flush_tlb_page from handle_page_fault > > - move unmap() from flush_thread to init_new_context > > (or do it as part of start_userspace) > > > > So, I did try this using nasty hacks and IIRC one of my runs was going > > from 21s to 16s and another from 63s to 56s. Which seems like a nice > > improvement. > > Excellent. I assume you were using hostfs as usual, right? If so, the > difference is likely to be even more noticeable on ubd. Yes, I was mostly testing hostfs. Initially also virtiofs with DAX, but I went back as that didn't result in a pagefault count improvement once I made some other adjustments. Benjamin > > > > > Benjamin > > > > > > PS: As for DAX, it doesn't really seem to help performance. It didn't > > seem to lower the amount of page faults in UML. And, from my > > perspective, it isn't really worth just for the memory sharing. > > > > PPS: dirty/young tracking seemed to be only cause a small amount of > > page faults in the grand scheme. So probably not something worth > > following up on. > > >