Hi Igor,
Igor Mammedov, Nov 01, 2024 at 16:09: > On Tue, 22 Oct 2024 16:16:36 +0200 > "Anthony Harivel" <ahari...@redhat.com> wrote: > >> Daniel P. Berrangé, Oct 22, 2024 at 15:15: >> > On Tue, Oct 22, 2024 at 02:46:15PM +0200, Igor Mammedov wrote: >> >> On Fri, 18 Oct 2024 13:59:34 +0100 >> >> Daniel P. Berrangé <berra...@redhat.com> wrote: >> >> >> >> > On Fri, Oct 18, 2024 at 02:25:26PM +0200, Igor Mammedov wrote: >> >> > > On Wed, 16 Oct 2024 14:56:39 +0200 >> >> > > "Anthony Harivel" <ahari...@redhat.com> wrote: >> >> [...] >> >> >> >> > > >> >> > > This also leads to a question, if we should account for >> >> > > not VCPU threads at all. Looking at real hardware, those >> >> > > MSRs return power usage of CPUs only, and they do not >> >> > > return consumption from auxiliary system components >> >> > > (io/memory/...). One can consider non VCPU threads in QEMU >> >> > > as auxiliary components, so we probably should not to >> >> > > account for them at all when modeling the same hw feature. >> >> > > (aka be consistent with what real hw does). >> >> > >> >> > I understand your POV, but I think that would be a mistake, >> >> > and would undermine the usefulness of the feature. >> >> > >> >> > The deployment model has a cluster of hosts and guests, all >> >> > belonging to the same user. The user goal is to measure host >> >> > power consumption imposed by the guest, and dynamically adjust >> >> > guest workloads in order to minimize power consumption of the >> >> > host. >> >> >> >> For cloud use-case, host side is likely in a better position >> >> to accomplish the task of saving power by migrating VM to >> >> another socket/host to compact idle load. (I've found at least 1 >> >> kubernetis tool[1], which does energy monitoring). Perhaps there >> >> are schedulers out there that do that using its data. >> >> I also work for Kepler project. I use it to monitor my VM has a black >> box and I used it inside my VM with this feature enable. Thanks to that >> I can optimize the workloads (dpdk application,database,..) inside my VM. >> >> This is the use-case in NFV deployment and I'm pretty sure this could be >> the use-case of many others. >> >> > >> > The host admin can merely shuffle workloads around, hoping that >> > a different packing of workloads onto machines, will reduce power >> > in some aount. You might win a few %, or low 10s of % with this >> > if you're good at it. >> > >> > The guest admin can change the way their workload operates to >> > reduce its inherant power consumption baseline. You could easily >> > come across ways to win high 10s of % with this. That's why it >> > is interesting to expose power consumption info to the guest >> > admin. >> > >> > IOW, neither makes the other obsolete, both approaches are >> > desirable. >> > >> >> > The guest workloads can impose non-negligble power consumption >> >> > loads on non-vCPU threads in QEMU. Without that accounted for, >> >> > any adjustments will be working from (sometimes very) inaccurate >> >> > data. >> >> >> >> Perhaps adding one or several energy sensors (ex: some i2c ones), >> >> would let us provide auxiliary threads consumption to guest, and >> >> even make it more granular if necessary (incl. vhost user/out of >> >> process device models or pass-through devices if they have PMU). >> >> It would be better than further muddling vCPUs consumption >> >> estimates with something that doesn't belong there. >> >> I'm confused about your statement. Like every software power metering >> tools out is using RAPL (Kepler, Scaphandre, PowerMon, etc) and custom >> sensors would be better than a what everyone is using ? > > RAPL is used to measure CPU/DRAM/maybe GPU domains. > see my other reply to Daniel RAPL + aux > (https://www.mail-archive.com/qemu-devel@nongnu.org/msg1072593.html) > My point wrt RAPL is: CPU domain on host and inside guest > should be doing the same thing, i.e. report only package/core > consumption of virtual CPU and nothing else (non vCPU induced load > should not be included in CPU domain). > > For non vCPU consumption, we should do the same as bare-metal, > i.e. add power sensors where necessary. As minimum we can add > a system power meter sensor, which could account for total > energy draw (and that can include not only QEMU aux threads, > but also for other related processes (aka process handling dpdk NIC, > or other vhost user backend)). > Individual per device sensors also a possibility in the future > (i.e per NIC) is we can find a suitable sensor on host to derive > guest value. > > [...] > >> Adding RAPL inside VM makes total sens because you can use tools that >> are already out in the market. > no disagreement here. > > Given the topic is relatively new, the tooling mostly concentrates on > RAPL as most available sensor. But some tools can pull energy values > from other sources, we surely can teach them to pull values from > a sensor(s) we'd want to add to QEMU (i.e. for an easy start borrow > sensor handling from lm_sensors). I'd pick acpi power meter as > a possible candidate for it is being guest OS agnostic and > we can attach it to anything in machine tree. > >> > There's a tradeoff here in that info directly associated with >> > backends threads, is effectively exposing private QEMU impl >> > details as public ABI. IOW, we don't want too fine granularity >> > here, we need it abstracted sufficiently, that different >> > backend choices for a given don't change what sensors are >> > exposed. >> > >> > I also wonder how existing power monitoring applications >> > would consume such custom sensors - is there sufficient >> > standardization in this are that we're not inventing >> > something totally QEMU specific ? >> > >> >> > IOW, I think it is right to include non-vCPU threads usage in >> >> > the reported info, as it is still fundamentally part of the >> >> > load that the guest imposes on host pCPUs it is permitted to >> >> > run on. >> >> >> >> >> >> From what I've read, process energy usage done via RAPL is not >> >> exactly accurate. But there are monitoring tools out there that >> >> use RAPL and other sources to make energy consumption monitoring >> >> more reliable. >> >> >> >> Reinventing that wheel and pulling all of the nuances of process >> >> power monitoring inside of QEMU process, needlessly complicates it. >> >> Maybe we should reuse one of existing tools and channel its data >> >> through appropriate QEMU channels (RAPL/emulated PMU counters/...). >> > >> > Note, this feature is already released in QEMU 9.1.0. >> > >> >> Implementing RAPL in pure form though looks fine to me, >> >> so the same tools could use it the same way as on the host >> >> if needed without VM specific quirks. >> > >> > IMHO the so called "pure" form is misleading to applications, unless >> > we first provided some other pratical way to expose the data that >> > we would be throwing away from RAPL. >> > >> >> The other possibility that I've think of is using a 3rd party tool to >> give maybe more "accurate value" to QEMU. >> For example, Kepler could be used to give value for each thread >> of QEMU and so instead of calculating and using the qemu-vmsr-helper, >> each values is transfered on request by QEMU via the UNIX thread that is >> used today between the daemon and QEMU. It's just an idea that I have >> and I don't know if that is acceptable for each project (QEMU and >> Kepler) that would really solve few issues. > > From QEMU point of view, it would be fine to get values from external > process and just proxy them to guest (preferably without any massaging). > > Also on QEMU side, I'd suggest to split current monolith functionality > in 2 parts: frontend (KVM MSR interface for starters) and backend object > (created with -object CLI option) that will handle communication > with an external daemon. That way QEMU would be able easily change/add > different frontend and backend options (ex: add frontend for RAPL > with TCG accel, add backend for Kelper or other project(s) > down the road). (it would be good to make this split even for > qemu-vmsr-helper). (if you are interested, I can guide you wrt > QEMU side of the question). > > PS: > As for other projects we probably should ask if they are open to an idea. > They definitely would need some patches for per thread accounting, > and maybe for some API to talk with external users (but the later > might exist and it might be better for QEMU to adopt it (here QEMU > backend object might help as translator of existing protocol to > QEMU specific internals). > The point is QEMU won't have to reinvent wheel, and other projects > will get more exposure/user-base. > > On top of the projects, you've already pointed out for possible > integration with. I could add pmdadenki (CCed few authors) which > some distros are shipping/using. > I think you have a fair amount of ideas and opinions on how to handle the RAPL in QEMU and that's really good for improving the features. What I would really like is to have Paolo's opinions on all of that. When I started working on the subject I talked to him several time and we agreed on the current implementation. Not that I disagree with all you said, to the contrary, but the amount of change is quite significant and it would be very annoying if results of this work doesn't make upstream because of Y & X. Let's see if we have more opinions from the people in the loop as well. Thanks for feedback. Anthony