After a discussion with Richard yesterday on IRC, I ran some numbers on the performance impact of the guard pages added at the end of TCG regions. These 4kB guard pages prevent the use of transparent huge pages for the last 2MB of a region.
The benchmark I used is the bootup+shutdown of debian on aarch64. This generates code at a high rate (for a total of ~350MB), which performance-wise makes it sensitive to changes to page lookup latency for memory accesses to the TB cache. Find here a chart with several -smp 1 runs for different configurations, with the error bars delimiting a 95% confidence interval for the average (IOW, there's a 95% chance that the true average is within the error bars): https://imgur.com/G5Gd39O I used -tb-size 1024, to make sure there aren't any flushes. Configurations: - master: single region. It does use huge pages whenever pages are written to (checked via /procs/$pid/smaps). - 2/4M no guards: forced region size to 2/4 MB, and removed the guard pages at the end. - 2/4M guards: same as above, but keeping the guard pages. - nohuge: single region, but passing MADV_NOHUGEPAGE to madvise. As seen in the chart, 2M-guards performs similarly to nohuge. This makes sense, because they both result in not using hugepages. Master, 2M-no-guards and 4M-no-guards have similar performance, about 3% better than nohuge/2M-guards. (There's quite a bit of noise because I only ran these 7 times each, but the confidence intervals have a wide overlap.) 4M-guards performs in between the above two groups. With 4M-guards, one half of each region is a huge page (2M), while the other half is broken into 4K pages due to the guard page. The conclusion I draw from the above is that as long as we keep regions sufficiently large (>=4M), we won't be able to measure performance regressions due to less huge page use. Given the above, I plotted what region sizes we obtain for different -smp and -tb-size combinations: https://imgur.com/RDF56Nv We can see how region_size == 2 only occurs when we either have very large -smp's, or small TB cache sizes (<= 256 MB). So, what to do based on all this? I think the current implementation makes sense. That said, two things we could consider doing are: - Remove the guard pages. I'd rather keep them in though, since their cost is negligible in most scenarios. If we really wanted to recover this performance, we could just enable the guards (by calling mprotect) only under TCG_DEBUG. - Change tcg_n_regions() to assign larger regions. I am not convinced of this, since at most we'd gain 3% in performance, but we might end up wasting more space (recall that not all vCPUs translate code at the same rate), or flushing more often (probably with a perf impact larger than 3%). It is also possible that other workloads might be more sensitive to the use of huge pages for the TB cache, but I couldn't think of any. Besides, the bootup+shutdown test is very easy to run =) So based on the above numbers, those are my thoughts. Thanks, Emilio PS. You can see both charts side-by-side here: https://imgur.com/a/cu6g0pA