First, sorry for the late response; I was away for a few days.
Peter Maydell writes: > On 18 September 2017 at 18:09, Lluís Vilanova <vilan...@ac.upc.edu> wrote: >> Peter Maydell writes: >>> It's also exposing internal QEMU implementation detail. >>> What if in future we decide to switch from our current >>> setup to always interpreting guest instructions as a >>> first pass with JITting done only in the background for >>> hot code? >> >> TCI still has a separation of translation-time (translate.c) and >> execution-time >> (interpreting the TCG opcodes), and I don't think that's gonna go away >> anytime >> soon. > I didn't mean TCI, which is nothing like what you'd use for > this if you did it (TCI is slower than just JITting.) My point is that even on the cold path you need to decode a guest instruction (equivalent to translating) and emulate it on the spot (equivalent to executing). >> Even if it did, I think there still will be a translation/execution >> separation >> easy enough to hook into (even if it's a "fake" one for the cold-path >> interpreted instructions). > But what would it mean? You don't have basic blocks any more. Every instruction emulated on the spot can be seen as a newly translated block (of one instruction only), which is executed immediately after. >>> Sticking to instrumentation events that correspond exactly to guest >>> execution events means they won't break or expose internals. >> >> It also means we won't be able to "conditionally" instrument instructions >> (e.g., >> based on their opcode, address range, etc.). > You can still do that, it's just less efficient (your > condition-check happens in the callout to the instrumentation > plugin). We can add "filter" options later if we need them > (which I would rather do than have translate-time callbacks). Before answering, a short summary of when knowing about translate/execute makes a difference: * Record some information only once when an instruction is translated, instead of recording it on every executed instruction (e.g., a study of opcode distribution, which you can get from a file of per-TB opcodes - generated at translation time - and a list of executed TBs - generated at execution time -). The translate/execute separation makes this run faster *and* produces much smaller files with the recorded info. Other typical examples that benefit from this are writing a simulator that feeds off a stream of instruction information (a common reason why people want to trace memory accesses and information of executed instructions). * Conditionally instrumenting instructions. Adding filtering to the instrumentation API would only solve the second point, but not the first one. Now, do we need/want to support the first point? >> Of course we can add the translation/execution differentiation later if we >> find >> it necessary for performance, but I would rather avoid leaving "historical" >> instrumentation points behind on the API. >> >> What are the use-cases you're aiming for? > * I want to be able to point the small stream of people who come > into qemu-devel asking "how do I trace all my guest's memory > accesses" at a clean API for it. > * I want to be able to have less ugly and confusing tracing > than our current -d output (and perhaps emit tracing in formats > that other analysis tools want as input) > * I want to keep this initial tracing API simple enough that > we can agree on it and get a first working useful version. Fair enough. I know it's not exactly the same we're discussing, but the plot in [1] compares a few different ways to trace memory accesses on SPEC benchmarks: * First bar is using a Intel's tool called PIN [2]. * Second is calling into an instrumentation function on every executed memory access in QEMU. * Third is embedding the hot path of writing the memory access info to an array into the TCG opcode stream (more or less equivalent to supporting filtering; when the array is full, a user's callback is called - cold path -) * Fourth bar can be ignored. This was working on a much older version of instrumentation for QEMU, but I can implement something that does the first use-case point above and some filtering example (second use-case point) to see what's the performance difference. [1] https://filetea.me/n3wy9WwyCCZR72E9OWXHArHDw [2] https://software.intel.com/en-us/articles/pin-a-dynamic-binary-instrumentation-tool Thanks! Lluis