Re: [PATCH 00/16] make system memory API available for common code

BALATON Zoltan Mon, 10 Mar 2025 14:03:12 -0700

On Mon, 10 Mar 2025, Pierrick Bouvier wrote:

On 3/10/25 12:40, BALATON Zoltan wrote:

On Mon, 10 Mar 2025, Pierrick Bouvier wrote:

On 3/10/25 09:28, Pierrick Bouvier wrote:

Hi Zoltan,


On 3/10/25 06:23, BALATON Zoltan wrote:

On Sun, 9 Mar 2025, Pierrick Bouvier wrote:

The main goal of this series is to be able to call any memory ld/st
function
from code that is *not* target dependent.


Why is that needed?


this series belongs to the "single binary" topic, where we are trying to
build a single QEMU binary with all architectures embedded.


Yes I get it now, I just forgot as this wasn't mentioned so the goal
wasn't obvious.

The more I work on this topic, the more I realize we miss a clear and concisedocument (wiki page, or anything than can be edited easily - not email)explaining this to other developers, and that we could share as a link, andenhance based on the questions asked.

Maybe you can start collecting FAQ on a wiki page so you don't have toanswer them multiple times. I think most people aware of this though justmay not associate a series with it if not mentioned in the description.

To achieve that, we need to have every single compilation unit compiled
only once, to be able to link a binary without any symbol conflict.

A consequence of that is target specific code (in terms of code relying
of target specific macros) needs to be converted to common code,
checking at runtime properties of the target we run. We are tackling
various places in QEMU codebase at the same time, which can be confusing
for the community members.


Mentioning this single binary in related series may help reminding readers
about the context.


I'll make sure to mention this "name" in the title for next series, thanks!

This series take care of system memory related functions and associated
compilation units in system/.

As a positive side effect, we can
turn related system compilation units into common code.


Are there any negative side effects? In particular have you done any
performance benchmarking to see if this causes a measurable slow down?
Such as with the STREAM benchmark:
https://stackoverflow.com/questions/56086993/what-does-stream-memory-bandwidth-benchmark-really-measure

Maybe it would be good to have some performance tests similiar to
functional tests that could be run like the CI tests to detect such

performance changes. People report that QEMU is getting slower andslowerwith each release. Maybe it could be a GSoC project to make such testsbut

maybe we're too late for that.


I agree with you, and it's something we have mentioned during our
"internal" conversations. Testing performance with existing functional
tests would already be a first good step. However, given the poor
reliability we have on our CI runners, I think it's a bit doomed.

Ideally, every QEMU release cycle should have a performance measurement
window to detect potential sources of regressions.


Maybe instead of aiming for full CI like performance testing something
simpler like a few tests that excercise some apects each like STREAM that
tests memory access, copying a file from network and/or disk that tests
I/O and mp3 encode with lame for example that's supposed to test floating
point and SIMD might be simpler to do. It could be made a bootable image
that just runs the test and reports a number (I did that before for
qemu-system-ppc when we wanted to test an issue that on some hosts it ran
slower). Such test could be run by somebody making changes so they could
call these before and after their patch to quickly check if there's
anything to improve. This may be less through then full performance
testing but still give some insight and better than not testing anything
for performance.

I'm bringig this topic up to try to keep awareness on this so QEMU can
remain true to its name. (Although I'm not sure if originally the Q in the
name stood for the time it took to write or its performance but it's
hopefully still a goal to keep it fast.)

You do well to remind that, but as always, the problem is that "run bysomebody" is not an enforceable process.

To answer to your specific question, I am trying first to get a review
on the approach taken. We can always optimize in next series version, in
case we identify it's a big deal to introduce a branch for every memory
related function call.


I'm not sure we can always optimise after the fact so sometimes it can be
necessary to take performance in consideration while designing changes.

In the context of single binary concerned series, we mostly introduce a fewbranches in various spots, to do a runtime check.As Richard mentioned in this series, we can keep target code exactly as itis.

In all cases, transforming code relying on compile time
optimization/dead code elimination through defines to runtime checks
will *always* have an impact,


Yes, that's why it would be good to know how much impact is that.

even though it should be minimal in most of cases.


Hopefully but how do we know if we don't even test for it?

In the case of this series, I usually so a local test booting (automatically)an x64 debian stable vm, that poweroff itself as part of its init.

With and without this series, the variation is below the average one I havebetween two runs (<1 sec, for a total of 40 seconds), so the impact islitterally invisible.

That's good to hear. Some overhead which is unavoidable is OK I just hopewe can avoid which is not unavoidable and try to do something about whatwould have noticable performance penalty. If you're already aware of thatand do that then that's all I wanted to say, nothing new.

But the maintenance and compilation time benefits, as well as
the perspectives it opens (single binary, heterogeneous emulation, use
QEMU as a library) are worth it IMHO.
I'm not so sure about that. Heterogeneous emulation sounds interesting but
is it needed most of the time? Using QEMU as a library also may not be
common and limited by licencing. The single binary would simplify packages
but then this binary may get huge so it's slower to load, may take more
resources to run and more time to compile and if somebody only needs one
architecture why do I want to include all of the others and wait for it to
compile using up a lot of space on my disk? So in other words, while these
are interesting and good goals could it be achieved with keeping the
current way of building single ARCH binary as opposed to single binary
with multiple archs and not throwing out the optimisations a single arch
binary can use? Which one is better may depend on the use case so if
possible it would be better to allow both keeping what we have and adding
multi arch binary on top not replacing the current way completely.
Thanks, it's definitely interesting to hear the concerns on this, so we canaddress them, and find the best and minimal solution to achive the desiredgoal.
I'll answer point by point.
QEMU as a library: that's what Unicorn is(https://www.unicorn-engine.org/docs/beyond_qemu.html), which is used by alot of researchers. Talking frequently with some of them, they would be happyto have such a library directly with upstream QEMU, so it can benefit fromall the enhancements done to TCG. It's mostly a use case for securityresearchers/engineers, but definitely a valid one. Just look at the list ofQEMU downstream forks focused on that. Combining this with plugins would beamazing, and only grow our list of users.
For the heterogeneous scenario, yes it's not the most common case. But we*must*, in terms of QEMU binary, be able to have a single binary first. Bythat, I mean the need is to be able to link a binary with several archpresent, without any symbol conflict.

OK Unicorn engine explains it and it needs multiple targets in singlelibrary (which maybe is the real goal, not a single binary here and thatonly needs targets not all devices). By the way I think multiple-arch iswhat they really mean on that beyond_qemu.html page above underThread-safety.

The other approach possible is to rename many functions through QEMU codebaseby adding a target_prefix everywhere, which would be ugly and endless. That'swhy we are currently using the "remove duplicated compilation units"pragmatic approach. As well, we can do a lot of headers cleanup on the way(removing useless dependencies), which is good for everyone.
For compilation times, it will only speed it up, because in case you haveonly specific targets, non-needed files won't be compiled/linked. For multitarget setup, it's only a speed up (with all targets, it would be a drop from9000+ CUs to around 4000+). Less disk space as well, most notable in debug.As well, having files compiled only once allow to use reliably codeindexation tools (clangd for instance), instead of picking a random CUsetting based on one target.Finally, having a single binary would mean it's easy to use LTO (or at leastdistros would use it easily), and get the same or better performance as whatwe have today.
The "current" way, with several binaries, can be kept forever if people

As I said I think that would be needed as there are valid use cases forboth.

wants. But it's not feasible to keep headers and cu compatible for bothmodes. It would be a lot of code duplication, and that is really notdesirable IMHO. So we need to do those system wide changes and convince thecommunity it's a good progress for everyone.

It would be nice to keep optimisations where possible and it seems itmight be possible sometimes so just take that in consideration as well notjust one goal.

Kudos to Philippe who has been doing this long and tedious work for severalyears now, and I hope that with some fresh eyes/blood, it can be completedsoon.

Absolutely and I did not mean to say not to do it just added another viewpoint for consideration.


Regards,
BALATON Zoltan

Re: [PATCH 00/16] make system memory API available for common code

Reply via email to