I've been thinking about ways to increase softmmu performance by speeding up TLB accesses.
Last year, Pranith proposed to increase the size of the TLBs: https://patchwork.kernel.org/patch/9927793/ The problem with that approach is that it slows down flushes significantly, since they have to memset(-1) large amounts of memory. And flushes can be very frequent, e.g. during bootup. This paper quantifies this issue (with SPEC06 but also a "kernel boot" workload), and proposes a way to avoid it: "Optimizing Memory Translation Emulation in Full System Emulators" Xin Tong, Toshihiko Koju, and Motohiro Kawahito https://dl.acm.org/citation.cfm?id=2686034 The ACM version is behind a paywall, this other one is not: http://domino.research.ibm.com/library/cyberdig.nsf/papers/9F3255F2937BC44885257C750004B9F7/$File/RT0956.pdf The idea is to allocate a new TLB on a flush, thereby removing the need for memset at flush time (the paper assumes that the allocation+memset has previously been done, possibly in another thread). I like the idea of allocating a new TLB, since: - This will work with MTTCG; we'd reclaim the old array with RCU, which is OK because CPUs always execute under an RCU critical section. - The lookup "fast path" would take a hit due to executing an extra instruction, but as the paper shows the corresponding impact is very small compared to the benefits of having a larger TLB. An additional improvement that I have thought of is to get rid of memset(-1) altogether. Instead, we'd store addresses in the TLB as $real_address+1, so that 0xff..ff is stored as 0x00..00. That way, instead of malloc+memset we'd just calloc a new TLB, which should be much faster since we'd most likely get zeroed pages from mmap. The cost would be an additional instruction in the fast path to subtract 1 from the address in the TLB, but this extra instruction would be essentially free in modern CPUs. I have looked into implementing this approach but it would take me a long time to get proficient enough to generate the code I want from the i386 TCG backend. If someone could help with that, I could take care of the rest, i.e. changes to C code and measuring the perf impact. If we got good results, we could then look into implementing this for all TCG backends. BTW the paper also has other interesting ideas, for example "uninlining" TLB lookups, which they claim increases performance by 6%. I also looked into this but I fail to see how this could ever be maintainable, since we'd have to generate many subroutines, one for each combination of generation-time parameters that tcg_out_tlb_load takes. Thanks, Emilio