On Tue, Jul 25, 2017 at 9:37 AM, Bruce Hoult <br...@hoult.org> wrote: > Do you have any good estimates for how much of the execution time is > typically spent in instruction decode? > > RISC-V qemu is twice as fast as ARM or Aarch64 qemu, so it's doing something > right! > > (I suspect it's probably mostly the lack of needing to emulate condition > codes)
The last time I tried to profile qemu (system mode, running the Go bootstrap, I think), I didn't get very far because no jit-map and I wasn't able to get frame pointers working, but as far as I got none of the translate functions showed up. Most time spent in translated code, trampolines for entering and exiting translated code, TLB maintenance, and the code to choose which basic block to run next. Making the instruction decoder a bit slower is not likely to have much effect (but do do before and after measurements to be sure). Significant wins would come from reducing the number of switches between translations (e.g. by translating larger units, all code on a page at once, functions, traces), by making switches between translations cheaper (e.g. with inline caches), or by reducing the cost of access translation (e.g. two accesses relative to the same base register in the same translation often hit the same virtual page and can share translation effort, or more speculatively by using host translation hardware. (I am willing to discuss any of these further OFF-LIST ONLY.) -s