On Thu, Dec 12, 2013 at 1:07 PM, Xin Tong <trent.t...@gmail.com> wrote: > see questions below. > > On Tue, Dec 10, 2013 at 12:25 AM, Alex Bennée <alex.ben...@linaro.org> wrote: >> >> trent.t...@gmail.com writes: >> >>> Does anyone have profiles on how much time QEMU spends in translating >>> instructions. QEMU does not have a baseline interpreter nor does it >>> translate on trace-granularity. so i imagine QEMU must spend quite a bit >>> of time translating instructions. >> >> Not as much as you'd think. The translation stage isn't very complex and >> blocks only get translated once (modulo exceptions and self modifying >> code). If you run perf on your task you should see most of the time is >> spent in the generated code - if not please send the test case to the >> list. > > I took a profile running speccpu2006 403.gcc with test input on a > intel xeon machine. we only spent 44.76% of the time in the code cache > (i.e. 13M ticks in the code cache), while 40.97% of the time is spent > in the qemu-system-x86_64. some of the hot functions in > qemu-system-x86_64 are listed below. > > *you are right* we do not spend much time in translation routines. > instead we spend significant amount of time in address translation > code. > > CPU_CLK_UNHALTED % Symbol/Functions > 1340512 100.00 anon (tgid:7106 range:0x7f97815ca000-0x7f979a692000) > > > CPU_CLK_UNHALTED % Symbol/Functions > 314655 25.64 address_space_translate_internal > 308942 25.18 cpu_x86_exec > 128922 10.51 ldq_phys > 92345 7.53 cpu_x86_handle_mmu_fault > 62456 5.09 tlb_set_page > 49332 4.02 memory_region_is_ram > 31055 2.53 helper_le_ldq_mmu > 22048 1.80 memory_region_get_ram_addr > 19223 1.57 memory_region_section_get_iotlb > 15873 1.29 tcg_optimize > 14526 1.18 get_page_addr_code > 12601 1.03 memory_region_get_ram_ptr
However, being able to reuse cached blocks based on content in QEMU maybe a step towards reusing translated blocks across multiple invocations of QEMU. > > Xin > > >> >> I suspect the more useful statistic would be getting a break down of the >> translation blocks and seeing which ones are the most heavily used and >> examining if QEMU has done as good a job as it can of translating them. >> >>> Is it possible for QEMU to obviate some of the translations by attaching a >>> signature (e.g. a hash) with every translated basic block and try to reuse >>> translated basic block based on the signature as much as possible ? Reuses >>> can be a result of rerunning programs or same libraries statically linked >>> to programs. >> >> Your right a translation cache *could* save some translation time, >> especially if you end up translating the same program over and over >> again. Having said that you might find the cost of computing the >> checksum obviates any speed-up from skipping the translation. After all >> QEMU only needs to look at each subject instruction once normally. >> >> Using QEMU linux-user for cross building would be the obvious pain >> point. However as the usual use case is building for embedded platforms >> most users are just happy to fully utilise their 80-core build machines >> in preference to having a farm of slow embedded processors. >> >>> This could end up saving some translation time. >> >> I think you would need to do some performance analysis and come up with >> some numbers before you made that assumption. >> >> Cheers, >> >> -- >> Alex Bennée >> QEMU/KVM Hacker for Linaro >>