On 1/19/19 2:04 AM, Alex Bennée wrote: > > Emilio G. Cota <c...@braap.org> writes: > >> As the following experiments show, this series is a net perf gain, >> particularly for memory-heavy workloads. Experiments are run on an >> Intel(R) Xeon(R) Gold 6142 CPU @ 2.60GHz. >> >> 1. System boot + shudown, debian aarch64: >> >> - Before (v3.1.0): >> Performance counter stats for './die.sh v3.1.0' (10 runs): >> >> 9019.797015 task-clock (msec) # 0.993 CPUs utilized >> ( +- 0.23% ) >> 29,910,312,379 cycles # 3.316 GHz >> ( +- 0.14% ) >> 54,699,252,014 instructions # 1.83 insn per cycle >> ( +- 0.08% ) >> 10,061,951,686 branches # 1115.541 M/sec >> ( +- 0.08% ) >> 172,966,530 branch-misses # 1.72% of all branches >> ( +- 0.07% ) >> >> 9.084039051 seconds time elapsed >> ( +- 0.23% ) >> >> - After: >> Performance counter stats for './die.sh tlb-dyn-v5' (10 runs): >> >> 8624.084842 task-clock (msec) # 0.993 CPUs utilized >> ( +- 0.23% ) >> 28,556,123,404 cycles # 3.311 GHz >> ( +- 0.13% ) >> 51,755,089,512 instructions # 1.81 insn per cycle >> ( +- 0.05% ) >> 9,526,513,946 branches # 1104.641 M/sec >> ( +- 0.05% ) >> 166,578,509 branch-misses # 1.75% of all branches >> ( +- 0.19% ) >> >> 8.680540350 seconds time elapsed >> ( +- 0.24% ) >> >> That is, a 4.4% perf increase. >> >> 2. System boot + shutdown, ubuntu 18.04 x86_64: >> >> - Before (v3.1.0): >> 56100.574751 task-clock (msec) # 1.016 CPUs utilized >> ( +- 4.81% ) >> 200,745,466,128 cycles # 3.578 GHz >> ( +- 5.24% ) >> 431,949,100,608 instructions # 2.15 insn per cycle >> ( +- 5.65% ) >> 77,502,383,330 branches # 1381.490 M/sec >> ( +- 6.18% ) >> 844,681,191 branch-misses # 1.09% of all branches >> ( +- 3.82% ) >> >> 55.221556378 seconds time elapsed >> ( +- 5.01% ) >> >> - After: >> 56603.419540 task-clock (msec) # 1.019 CPUs utilized >> ( +- 10.19% ) >> 202,217,930,479 cycles # 3.573 GHz >> ( +- 10.69% ) >> 439,336,291,626 instructions # 2.17 insn per cycle >> ( +- 14.14% ) >> 80,538,357,447 branches # 1422.853 M/sec >> ( +- 16.09% ) >> 776,321,622 branch-misses # 0.96% of all branches >> ( +- 3.77% ) >> >> 55.549661409 seconds time elapsed >> ( +- 10.44% ) >> >> No improvement (within noise range). Note that for this workload, >> increasing the time window too much can lead to perf degradation, >> since it flushes the TLB *very* frequently. > > I would expect this to be fairly minimal in the amount of memory that is > retouched. We spend a bunch of time paging things in just to drop > everything and die. However heavy memory operations like my build stress > test do see a performance boost. > > Tested-by: Alex Bennée <alex.ben...@linaro.org> > Reviewed-by: Alex Bennée <alex.ben...@linaro.org> > > Do you have access to any aarch64 hardware? It would be nice to see if > we could support it there as well.
I've already done some porting to other backends. You should be able to cherry-pick from https://github.com/rth7680/qemu.git cputlb-resize as I don't think the backend API has changed since v6. (Most of my feedback that went into v7 was due to issues I encountered porting to arm32). r~