Re: [PATCH v6 0/3] Add support for the RAPL MSRs series

Igor Mammedov Tue, 22 Oct 2024 08:36:50 -0700

On Tue, 22 Oct 2024 14:15:44 +0100
Daniel P. Berrangé <berra...@redhat.com> wrote:

> On Tue, Oct 22, 2024 at 02:46:15PM +0200, Igor Mammedov wrote:
> > On Fri, 18 Oct 2024 13:59:34 +0100
> > Daniel P. Berrangé <berra...@redhat.com> wrote:
> >   
> > > On Fri, Oct 18, 2024 at 02:25:26PM +0200, Igor Mammedov wrote:  
> > > > On Wed, 16 Oct 2024 14:56:39 +0200
> > > > "Anthony Harivel" <ahari...@redhat.com> wrote:  
> > [...]
> >   
> > > > 
> > > > This also leads to a question, if we should account for
> > > > not VCPU threads at all. Looking at real hardware, those
> > > > MSRs return power usage of CPUs only, and they do not
> > > > return consumption from auxiliary system components
> > > > (io/memory/...). One can consider non VCPU threads in QEMU
> > > > as auxiliary components, so we probably should not to
> > > > account for them at all when modeling the same hw feature.
> > > > (aka be consistent with what real hw does).    
> > > 
> > > I understand your POV, but I think that would be a mistake,
> > > and would undermine the usefulness of the feature.
> > > 
> > > The deployment model has a cluster of hosts and guests, all
> > > belonging to the same user. The user goal is to measure host
> > > power consumption imposed by the guest, and dynamically adjust
> > > guest workloads in order to minimize power consumption of the
> > > host.  
> > 
> > For cloud use-case, host side is likely in a better position
> > to accomplish the task of saving power by migrating VM to
> > another socket/host to compact idle load. (I've found at least 1
> > kubernetis tool[1], which does energy monitoring). Perhaps there
> > are schedulers out there that do that using its data.  
> 
> The host admin can merely shuffle workloads around, hoping that
> a different packing of workloads onto machines, will reduce power
> in some aount. You might win a few %, or low 10s of % with this
> if you're good at it.

package level savings probably won't make a much of dent (older hw, less 
impact),
but if one would think about vacating/powering down host it's a bit
different story (it was in my home lab case - trying to
minimize idle consumption of 24/7 systems). But even with
that when switching to newer hardware it might come to the point
of diminishing returns eventually.

> The guest admin can change the way their workload operates to
> reduce its inherant power consumption baseline. You could easily
> come across ways to win high 10s of % with this. That's why it
> is interesting to expose power consumption info to the guest
> admin.

Looking at discussions around Intel's hybrid CPUs, I got
an impression that not userspace nor kernel have enough energy
consumption info to make decent scheduling decision and no _one
really wishes do scheduling manually_ to begin with. That's where
Intel's CPUs with IDT come into the picture to help kernel
somehow bin tasks based on efficiency figures (since CPU knows
exactly how much resources it is using).
But that's relatively new and whether such cpus will stick or
not is still an open question (it makes sense for mobile market,
but for other applications I'd guess time will show).

> IOW, neither makes the other obsolete, both approaches are
> desirable.

no argument here.

> > > The guest workloads can impose non-negligble power consumption
> > > loads on non-vCPU threads in QEMU. Without that accounted for,
> > > any adjustments will be working from (sometimes very) inaccurate
> > > data.  
> > 
> > Perhaps adding one or several energy sensors (ex: some i2c ones),
> > would let us provide auxiliary threads consumption to guest, and
> > even make it more granular if necessary (incl. vhost user/out of
> > process device models or pass-through devices if they have PMU).
> > It would be better than further muddling vCPUs consumption
> > estimates with something that doesn't belong there.  
> 
> There's a tradeoff here in that info directly associated with
> backends threads, is effectively exposing private QEMU impl
> details as public ABI. IOW, we don't want too fine granularity
> here, we need it abstracted sufficiently, that different
> backend choices for a given don't change what sensors are
> exposed.
> 
> I also wonder how existing power monitoring applications
> would consume such custom sensors - is there sufficient
> standardization in this are that we're not inventing
> something totally QEMU specific ?

we can expose them as ACPI power meter devices, to make it
abstract for guest OS (i.e. guest would need only a standard
driver for it) or alternatively model some of real i2c
sensors. But yes, it something that should be explored so
it would work/supported by common tools or the tool of the choice.

> 
> > > IOW, I think it is right to include non-vCPU threads usage in
> > > the reported info, as it is still fundamentally part of the
> > > load that the guest imposes on host pCPUs it is permitted to
> > > run on.  
> > 
> > 
> > From what I've read, process energy usage done via RAPL is not
> > exactly accurate. But there are monitoring tools out there that
> > use RAPL and other sources to make energy consumption monitoring
> > more reliable.
> > 
> > Reinventing that wheel and pulling all of the nuances of process
> > power monitoring inside of QEMU process, needlessly complicates it.
> > Maybe we should reuse one of existing tools and channel its data
> > through appropriate QEMU channels (RAPL/emulated PMU counters/...).  
> 
> Note, this feature is already released in QEMU 9.1.0.

that doesn't preclude us from improving impl. details 
/i.e. what tasks qemu does and what is upto backend (external daemon)/
though. Incl. changing backend if it that would do a better job
for in the end (with a benefit that it's mostly maintained by another project).

> > Implementing RAPL in pure form though looks fine to me,
> > so the same tools could use it the same way as on the host
> > if needed without VM specific quirks.  
> 
> IMHO the so called "pure" form is misleading to applications, unless
> we first provided  some other pratical way to expose the data that
> we would be throwing away from RAPL.
I don't argue that data should be thrown away. But just that we should
provide them some other way instead of vCPU RAPL interface. And not
confuse host's pCPU with vCPUs.

PS:
Taking example above that aux threads are inherent pCPU load and
stretch it in to host side. Then one can say pCPU inherently incurs
power draw on other system components with some workloads, so RAPL MSRs
should include that load as well.
But yep, at this point turns into a pointless bike-shedding.

PS2:
in nutshell, my questions are:
 * should we expose aux threads as other power meter device
 * would it be better to reuse/integrate with existing (hopefully mature)
   projects for monitoring on host side instead of duplicating a subset
   of capabilities in QEMU specific helper and then maintain it.

> With regards,
> Daniel

Re: [PATCH v6 0/3] Add support for the RAPL MSRs series

Reply via email to