On 5/18/20 8:18 PM, Alex Bennée wrote: > > Claudio Fontana <cfont...@suse.de> writes: > >> Hello all, >> >> my intention would be to develop per-target, dynamically loadable >> accelerator modules. >> >> This would allow to distribute a single QEMU base binary, and then provide >> accelerators as optional additional binary packages to install, >> with the first separate optional package being TCG. >> >> CONFIG_TCG would become 'm' as a result, but then also CONFIG_KVM, >> CONFIG_HAX, CONFIG_WHPX, CONFIG_HVF. >> >> Here are some elements that seem to be needed: >> >> 1 - The module CONFIG_MODULE part of the build system would need some >> extension to add per-target modules. I have some tentative results that >> shows that this is possible (but a bit clunky atm). >> There is some existing instability in the existing Makefile >> infrastructure of modules that shows up when trying to extend it. >> >> 2 - new "accelerator drivers" seems to be needed, either in addition or as >> additional functionality inside the current AccelState. >> >> 3 - for target/i386 in particular, there is some refactoring work needed to >> split even more different unrelated bits and pieces. >> dependencies of hw/i386 machine stuff with accelerator-specific >> stuff are also painful. > > There are a couple of per-arch hacks in the main TCG cpu loops it would > be good to excise from the code. > >> >> 4 - CPU Arch Classes could be extended with per-accelerator methods. Initial >> fooling around shows it should probably work. >> One alternative would be trying to play with the dynamic linker (weak >> symbols, creative use of dlsym etc), but I have not sorted out the details >> of this option. >> >> 5 - cputlb, in particular tlb_flush and friends is a separate problem >> since it is not part of the cpuclass. Should it be? > > tlb_flush and friends are TCG implementation details for softmmu that > ensure a lookup for any address will always return to a guest specific > tlb_fill to rebuild the cache. The behaviour is not guest-specific > per-se although we do partition the table entries based on the guest > size. > > Perhaps we can make it more dynamic but it would have to ensure both the > slow path and the fast path are using the same mask and shifts when > walking the table. > >> 6 - a painpoint is represented by the fact that in USER mode, the accel >> class is not applied, which causes a lot of uncleanliness all around >> (tcg_allowed outside of the AccelClass). > > The user-mode run loops are a whole separate chunk of code. I don't know > if it is worth trying to make them plugable as you will only ever have > one per linux-user binary. > >> 7 - I have not really thought about the KConfig aspects because I am not >> super-familiar >> >> 8 - cpus.c needs some good splitting > > Although there are several different run loops in there I think they > share a lot of commonality which is why they are bundled together. They > all share the same backend for dealing with ioevents and generic > pause/unpause machinery. But feel free to prove me wrong ;-)
Hi Alex, I got my first compile, and it is currently in github, I still need to split the series though and there is still cleanup needed. https://github.com/hw-claudio/qemu.git branch "cpus-refactoring" just in case you are interested in a peek. The idea results currently in: cpu-throttle.c | 154 +++++++++ cpu-timers.c | 784 +++++++++++++++++++++++++++++++++++++++++++++ cpus-tcg.c | 515 ++++++++++++++++++++++++++++++ cpus.c | 1405 +++++---------------------------------------------------------------------------- New interfaces are in: include/sysemu/cpu-throttle.h | 50 +++ include/sysemu/cpu-timers.h | 73 +++++ include/sysemu/cpus.h | 44 ++- cpu-throttle (new) is self-explanatory, contains the cpu-throttle in cpus.c cpu-timers (new) contains the icount, ticks, clock timers from cpus.c cpus.h adds an interface to per-accel vcpu threads: qemu_register_start_vcpu(void (*f)(CPUState *cpu)); bool all_cpu_threads_idle(void); bool cpu_can_run(CPUState *cpu); void qemu_wait_io_event_common(CPUState *cpu); void qemu_wait_io_event(CPUState *cpu); void cpu_thread_signal_created(CPUState *cpu); void cpu_thread_signal_destroyed(CPUState *cpu); void cpu_handle_guest_debug(CPUState *cpu); Very much still all WIP... Ciao, C > >> ... more things to find out and think about ... >> >> Overall, I think that the activity has the potential to provide benefits >> overall beyond the actual goal, in the form of cleanups, leaner >> configurations, >> minor fixes, maybe improving the CONFIG_MODULE instabilities if any >> etc. > > There are certainly some ugly bits we could shave down in such an > exercise. > >> As an example, the first activity I would plan to submit as RFC is point 8 >> above, >> there is the split between cpus.c and cpus-tcg.c that results in lots of >> TCG-specific code being removed from non-tcg builds (using CONFIG_TCG). >> >> One thing that should be kept in check is any performance impact of >> the changes, in particular for point 4, hot paths should probably >> avoid going through too many pointer indirections. > > Maybe - certainly for TCG you have pretty much lost if you have exited > the main execution loop I doubt it would show up much. Not so sure about > the HW accelerators. Most of the performance sensitive stuff is dealt > with close to the ioctls IIRC. > >> Does anybody share similar goals? Any major obstacle or blocker that would >> put the feasibility into question? >> Any suggestion on any of this? In particular point 4 and 5 come to >> mind, as well as some better understanding of the reasons behind 6, or >> even suggestions on how to best to 2. > > For linux-user each CPU run loop is it's own special snowflake because > pretty much every architecture has it's own special set of EXCP_ exits > to deal with various bits. There are per-arch EXCP_ flags for system > emulation as well but not nearly as many. > >> >> Anyway, I will continue to work on the first RFC for some smaller initial >> steps and hopefully have something to submit soon. >> >> Ciao ciao, >> >> Claudio > >