On Tue, 22 Oct 2024 14:15:44 +0100 Daniel P. Berrangé <berra...@redhat.com> wrote:
> On Tue, Oct 22, 2024 at 02:46:15PM +0200, Igor Mammedov wrote: > > On Fri, 18 Oct 2024 13:59:34 +0100 > > Daniel P. Berrangé <berra...@redhat.com> wrote: > > > > > On Fri, Oct 18, 2024 at 02:25:26PM +0200, Igor Mammedov wrote: > > > > On Wed, 16 Oct 2024 14:56:39 +0200 > > > > "Anthony Harivel" <ahari...@redhat.com> wrote: > > [...] > > > > > > > > > > This also leads to a question, if we should account for > > > > not VCPU threads at all. Looking at real hardware, those > > > > MSRs return power usage of CPUs only, and they do not > > > > return consumption from auxiliary system components > > > > (io/memory/...). One can consider non VCPU threads in QEMU > > > > as auxiliary components, so we probably should not to > > > > account for them at all when modeling the same hw feature. > > > > (aka be consistent with what real hw does). > > > > > > I understand your POV, but I think that would be a mistake, > > > and would undermine the usefulness of the feature. > > > > > > The deployment model has a cluster of hosts and guests, all > > > belonging to the same user. The user goal is to measure host > > > power consumption imposed by the guest, and dynamically adjust > > > guest workloads in order to minimize power consumption of the > > > host. > > > > For cloud use-case, host side is likely in a better position > > to accomplish the task of saving power by migrating VM to > > another socket/host to compact idle load. (I've found at least 1 > > kubernetis tool[1], which does energy monitoring). Perhaps there > > are schedulers out there that do that using its data. > > The host admin can merely shuffle workloads around, hoping that > a different packing of workloads onto machines, will reduce power > in some aount. You might win a few %, or low 10s of % with this > if you're good at it. package level savings probably won't make a much of dent (older hw, less impact), but if one would think about vacating/powering down host it's a bit different story (it was in my home lab case - trying to minimize idle consumption of 24/7 systems). But even with that when switching to newer hardware it might come to the point of diminishing returns eventually. > The guest admin can change the way their workload operates to > reduce its inherant power consumption baseline. You could easily > come across ways to win high 10s of % with this. That's why it > is interesting to expose power consumption info to the guest > admin. Looking at discussions around Intel's hybrid CPUs, I got an impression that not userspace nor kernel have enough energy consumption info to make decent scheduling decision and no _one really wishes do scheduling manually_ to begin with. That's where Intel's CPUs with IDT come into the picture to help kernel somehow bin tasks based on efficiency figures (since CPU knows exactly how much resources it is using). But that's relatively new and whether such cpus will stick or not is still an open question (it makes sense for mobile market, but for other applications I'd guess time will show). > IOW, neither makes the other obsolete, both approaches are > desirable. no argument here. > > > The guest workloads can impose non-negligble power consumption > > > loads on non-vCPU threads in QEMU. Without that accounted for, > > > any adjustments will be working from (sometimes very) inaccurate > > > data. > > > > Perhaps adding one or several energy sensors (ex: some i2c ones), > > would let us provide auxiliary threads consumption to guest, and > > even make it more granular if necessary (incl. vhost user/out of > > process device models or pass-through devices if they have PMU). > > It would be better than further muddling vCPUs consumption > > estimates with something that doesn't belong there. > > There's a tradeoff here in that info directly associated with > backends threads, is effectively exposing private QEMU impl > details as public ABI. IOW, we don't want too fine granularity > here, we need it abstracted sufficiently, that different > backend choices for a given don't change what sensors are > exposed. > > I also wonder how existing power monitoring applications > would consume such custom sensors - is there sufficient > standardization in this are that we're not inventing > something totally QEMU specific ? we can expose them as ACPI power meter devices, to make it abstract for guest OS (i.e. guest would need only a standard driver for it) or alternatively model some of real i2c sensors. But yes, it something that should be explored so it would work/supported by common tools or the tool of the choice. > > > > IOW, I think it is right to include non-vCPU threads usage in > > > the reported info, as it is still fundamentally part of the > > > load that the guest imposes on host pCPUs it is permitted to > > > run on. > > > > > > From what I've read, process energy usage done via RAPL is not > > exactly accurate. But there are monitoring tools out there that > > use RAPL and other sources to make energy consumption monitoring > > more reliable. > > > > Reinventing that wheel and pulling all of the nuances of process > > power monitoring inside of QEMU process, needlessly complicates it. > > Maybe we should reuse one of existing tools and channel its data > > through appropriate QEMU channels (RAPL/emulated PMU counters/...). > > Note, this feature is already released in QEMU 9.1.0. that doesn't preclude us from improving impl. details /i.e. what tasks qemu does and what is upto backend (external daemon)/ though. Incl. changing backend if it that would do a better job for in the end (with a benefit that it's mostly maintained by another project). > > Implementing RAPL in pure form though looks fine to me, > > so the same tools could use it the same way as on the host > > if needed without VM specific quirks. > > IMHO the so called "pure" form is misleading to applications, unless > we first provided some other pratical way to expose the data that > we would be throwing away from RAPL. I don't argue that data should be thrown away. But just that we should provide them some other way instead of vCPU RAPL interface. And not confuse host's pCPU with vCPUs. PS: Taking example above that aux threads are inherent pCPU load and stretch it in to host side. Then one can say pCPU inherently incurs power draw on other system components with some workloads, so RAPL MSRs should include that load as well. But yep, at this point turns into a pointless bike-shedding. PS2: in nutshell, my questions are: * should we expose aux threads as other power meter device * would it be better to reuse/integrate with existing (hopefully mature) projects for monitoring on host side instead of duplicating a subset of capabilities in QEMU specific helper and then maintain it. > With regards, > Daniel