* Emilio G. Cota (c...@braap.org) wrote: > On Fri, Mar 10, 2017 at 11:45:33 +0000, Dr. David Alan Gilbert wrote: > > * Emilio G. Cota (c...@braap.org) wrote: > > > https://github.com/cota/dbt-bench > > > I'm using NBench because (1) it's just a few files and they take > > > very little time to run (~5min per QEMU version, if performance > > > on the host machine is stable), (2) AFAICT its sources are in the > > > public domain (whereas SPEC's sources cannot be redistributed), > > > and (3) with NBench I get results similar to SPEC's. > > > > Does NBench include anything with lots of small processes, or a large > > chunk of code. Using benchmarks with small code tends to skew DBT > > optimisations > > towards very heavy block optimisation that dont work in real applications > > where > > the cost of translation can hurt if it's too high. > > Yes this is a valid point. > > I haven't looked at the NBench code in detail, but I'd expect all programs > in the suite to be small and have hotspots (this is consistent with > the fact that performance doesn't change even if the TB hash table > isn't used, i.e. the loops are small enough to remain in tb_jmp_cache.) > IOW, we'd be mostly measuring the quality of the translated code, > not the translation overhead. > > It seems that a good benchmark to take translation overhead into account > would be gcc/perlbench from SPEC (see [1]; ~20% of exec time is spent > on translation). Unfortunately, none of them can be redistributed. > > I'll consider other options. For instance, I looked today at using golang's > compilation tests, but they crash under qemu-user. I'll keep looking > at other options -- the requirement is to have something that is easy > to build (i.e. gcc is not an option) and that it runs fast.
Yes, needs to be self contained but large enough to be interesting. Isn't SPECs perlbench just a variant of a standard free benchmark that can be used? (Select alternative preferred language). > A hack that one can do to measure code translation as opposed to execution > is to disable caching with a 2-liner to avoid insertions to the TB hash > table and tb_jmp_cache. The problem is that then we basically just > measure code translation performance, which isn't really realistic > either. > > In any case, note that most efforts I've seen to compile very good code > (with QEMU or other cross-ISA DBT), do some sort of profiling so that > only hot blocks are optimized -- see for example [1] and [2]. Right, and often there's a trade off between an interpret step, and one or more translate/optimisation steps and have to pick thresholds etc. Dave > [1] "Characterization of Dynamic Binary Translation Overhead". > Edson Borin and Youfeng Wu. IISWC 2009. > http://amas-bt.cs.virginia.edu/2008proceedings/AmasBT2008.pdf#page=4 > > [2] "HQEMU: a multi-threaded and retargetable dynamic binary translator > on multicores". > Ding-Yong Hong, Chun-Chen Hsu, Pen-Chung Yew, Jan-Jan Wu, Wei-Chung Hsu > Pangfeng Liu, Chien-Min Wang and Yeh-Ching Chung. CGO 2012. > http://www.iis.sinica.edu.tw/papers/dyhong/18239-F.pdf > > > > > Here are linux-user performance numbers from v1.0 to v2.8 (higher > > > is better): > > > > > > x86_64 NBench Integer Performance > > > Host: Intel(R) Xeon(R) CPU E5-2690 0 @ 2.90GHz > > > > > > > > > > > > 36 > > > +-+-+---+---+---+--+---+---+---+---+---+---+---+---+--+---+---+---+-+-+ > > > | + + + + + + + + + + + + + + + + *** > > > | > > > 34 +-+ > > > #*A*+-+ > > > | *A* > > > | > > > 32 +-+ # > > > +-+ > > > 30 +-+ # > > > +-+ > > > | # > > > | > > > 28 +-+ # > > > +-+ > > > | *A*#*A*#*A*#*A*#*A*# # > > > | > > > 26 +-+ *A*#*A*#***# *** ******#*A* > > > +-+ > > > | # *A* *A* *** > > > | > > > 24 +-+ # > > > +-+ > > > 22 +-+ # > > > +-+ > > > | #*A**A* > > > | > > > 20 +-+ #*A* > > > +-+ > > > | *A*#*A* + + + + + + + + + + + + + + + > > > | > > > 18 > > > +-+-+---+---+---+--+---+---+---+---+---+---+---+---+--+---+---+---+-+-+ > > > > > > v1.v1.1v1.2v1.v1.4v1.5v1.6v1.7v2.0v2.1v2.2v2.3v2.v2.5v2.6v2.7v2.8.0 > > > QEMU version > > > > > > > Nice, there was someone on list complaining about 2.6 being slower for them. > > > > > x86_64 NBench Floating Point Performance > > > > > > Host: Intel(R) Xeon(R) CPU E5-2690 0 @ 2.90GHz > > > > > > > > > > > > 1.88 > > > +-+-+---+--+---+---+---+--+---+---+---+---+--+---+---+---+--+---+-+-+ > > > | + + + *A*#*A* + + + + + + + + + + + + > > > | > > > 1.86 +-+ *** *** > > > +-+ > > > | # # *A*#*** > > > | > > > | *A*# # # ## *A* > > > | > > > 1.84 +-+ # *A* *A* # > > > +-+ > > > | # # *A* > > > | > > > 1.82 +-+ # # ## > > > +-+ > > > | # *A*# # > > > | > > > 1.8 +-+ # # #*A* *A* > > > +-+ > > > | # *A* # # > > > | > > > 1.78 +-+*A* # *A* # > > > +-+ > > > | # ***# # # > > > | > > > | *A*#*A* # # > > > | > > > 1.76 +-+ *** # # > > > +-+ > > > | + + + + + + + + + + + + + + *A* + + > > > | > > > 1.74 > > > +-+-+---+--+---+---+---+--+---+---+---+---+--+---+---+---+--+---+-+-+ > > > > > > v1.v1.v1.2v1.3v1.4v1.v1.6v1.7v2.0v2.1v2.v2.3v2.4v2.5v2.v2.7v2.8.0 > > > QEMU version > > > > > > > I'm assuming the dips are where QEMU fixed something and cared about corner > > cases/accuracy? > > It'd be hard to say why the numbers vary across versions without running > a profiler and git bisect. I only know the reason for v2.7, where most if not > all > of the improvement is due to the removal of tb_lock() when executing > code in qemu-user thanks to the QHT work. > > E. -- Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK