On Thu, Aug 7, 2025 at 10:09 AM Xiaoyao Li <xiaoyao...@intel.com> wrote:
>
> On 8/7/2025 2:37 PM, Christian Ehrhardt wrote:
> > On Thu, Aug 7, 2025 at 5:38 AM Xiaoyao Li <xiaoyao...@intel.com> wrote:
> >>
> >> On 8/7/2025 3:18 AM, Daniel P. Berrangé wrote:
> >>> On Wed, Aug 06, 2025 at 07:57:34PM +0200, Christian Ehrhardt wrote:
> >>>> On Wed, Aug 6, 2025 at 2:00 PM Daniel P. Berrangé <berra...@redhat.com> 
> >>>> wrote:
> >>>>>
> >>>>> On Wed, Aug 06, 2025 at 01:52:17PM +0200, Christian Ehrhardt wrote:
> >>>>>> Hi,
> >>>>>> I was unsure if this would be better sent to libvirt or qemu - the
> >>>>>> issue is somewhere between libvirt modelling CPUs and qemu 10.1
> >>>>>> behaving differently. I did not want to double post and gladly most of
> >>>>>> the people are on both lists - since the switch in/out of the problem
> >>>>>> is qemu 10.0 <-> 10.1 let me start here. I beg your pardon for not yet
> >>>>>> having all the answers, I'm sure I could find more with debugging, but
> >>>>>> I also wanted to report early for your awareness while we are still in
> >>>>>> the RC phase.
> >>>>>>
> >>>>>>
> >>>>>> # Problem
> >>>>>>
> >>>>>> What I found when testing migrations in Ubuntu with qemu 10.1-rc1 was:
> >>>>>>     error: operation failed: guest CPU doesn't match specification:
> >>>>>> missing features: pdcm
> >>>>>>
> >>>>>> This is behaving the same with libvirt 11.4 or the more recent 11.6.
> >>>>>> But switching back to qemu 10.0 confirmed that this behavior is new
> >>>>>> with qemu 10.1-rc.
> >>>>>
> >>>>>
> >>>>>> Without yet having any hard evidence against them I found a few pdcm
> >>>>>> related commits between 10.0 and 10.1-rc1:
> >>>>>>     7ff24fb65 i386/tdx: Don't mask off CPUID_EXT_PDCM
> >>>>>>     00268e000 i386/cpu: Warn about why CPUID_EXT_PDCM is not available
> >>>>>>     e68ec2980 i386/cpu: Move adjustment of CPUID_EXT_PDCM before
> >>>>>> feature_dependencies[] check
> >>>>>>     0ba06e46d i386/tdx: Add TDX fixed1 bits to supported CPUIDs
> >>>>>>
> >>>>>>
> >>>>>> # Caveat
> >>>>>>
> >>>>>> My test environment is in LXD system containers, that gives me issues
> >>>>>> in the power management detection
> >>>>>>     libvirtd[406]: error from service: GDBus.Error:System.Error.EROFS:
> >>>>>> Read-only file system
> >>>>>>     libvirtd[406]: Failed to get host power management capabilities
> >>>>>
> >>>>> That's harmless.
> >>>>
> >>>> Yeah, it always was for me - thanks for confirming.
> >>>>
> >>>>>> And the resulting host-model on a  rather old test server will 
> >>>>>> therefore have:
> >>>>>>     <cpu mode='custom' match='exact' check='full'>
> >>>>>>       <model fallback='forbid'>Haswell-noTSX-IBRS</model>
> >>>>>>       <vendor>Intel</vendor>
> >>>>>>       <feature policy='require' name='vmx'/>
> >>>>>>       <feature policy='disable' name='pdcm'/>
> >>>>>>        ...
> >>>>>>
> >>>>>> But that was fine in the past, and the behavior started to break
> >>>>>> save/restore or migrations just now with the new qemu 10.1-rc.
> >>>>>>
> >>>>>> # Next steps
> >>>>>>
> >>>>>> I'm soon overwhelmed by meetings for the rest of the day, but would be
> >>>>>> curious if one has a suggestion about what to look at next for
> >>>>>> debugging or a theory about what might go wrong. If nothing else comes
> >>>>>> up I'll try to set up a bisect run tomorrow.
> >>>>>
> >>>>> Yeah, git bisect is what I'd start with.
> >>>>
> >>>> Bisect complete, identified this commit
> >>>>
> >>>> commit 00268e00027459abede448662f8794d78eb4b0a4
> >>>> Author: Xiaoyao Li <xiaoyao...@intel.com>
> >>>> Date:   Tue Mar 4 00:24:50 2025 -0500
> >>>>
> >>>>       i386/cpu: Warn about why CPUID_EXT_PDCM is not available
> >>>>
> >>>>       When user requests PDCM explicitly via "+pdcm" without PMU 
> >>>> enabled, emit
> >>>>       a warning to inform the user.
> >>>>
> >>>>       Signed-off-by: Xiaoyao Li <xiaoyao...@intel.com>
> >>>>       Reviewed-by: Zhao Liu <zhao1....@intel.com>
> >>>>       Link: 
> >>>> https://lore.kernel.org/r/20250304052450.465445-3-xiaoyao...@intel.com
> >>>>       Signed-off-by: Paolo Bonzini <pbonz...@redhat.com>
> >>>>
> >>>>    target/i386/cpu.c | 3 +++
> >>>>    1 file changed, 3 insertions(+)
> >>>>
> >>>>
> >>>>
> >>>> Which is odd as it should only add a warning right?
> >>>
> >>> No, that commit message is misleading.
> >>>
> >>> IIUC mark_unavailable_features() actively blocks usage of the feature,
> >>> so it is a functional change, not merely a emitting warning.
> >>>
> >>> It makes me wonder if that commit was actually intended to block the
> >>> feature or not, vs merely warning ?  CC'ing those involved in the
> >>> commit.
> >>
> >> The intention was to print a warning to tell users PDCM cannot be
> >> enabled if pmu is not enabled. While mark_unavailable_features() does
> >> has the effect of setting the bit in cpu->filtered_features[].
> >>
> >> But the feature is masked off anyway
> >
> > Right - it was disabled right from the beginning.
> > As I reported libvirt detected it as not available and constructed the
> > CPU as with it disabled.
> > Which translated it into -cpu ...,pdcm=off,...
> >
> > The new and bad aspect we need to overcome is that in these conditions
> > this now somehow breaks save/restore and migration operations.
>
> The commit 00268e0002 makes a difference only for the case "-cpu
> xxx,pdcm=on" without "pmu=on", and it emits a warning and sets the PDCM
> in cpu->filtered_features[].

But this is `pdcm=off` as I said above, yet with/without the change it
breaks the mentioned migration and save/restors.
But since you mentioned pmu, that isn't mentioned in the qemu cmdline
arguments that libvirt used and the base type is Haswell-noTSX-IBRS.

> So libvirt must first request with "-cpu xxx,pdcm=on" without "pmu=on"
> and gets the result that PDCM is filtered (set in cpu->filtered_features[]).
>
> This indeed introduces the behavior change that before the commit, "-cpu
> xxx,pdcm=on" without "pmu=on" doesn't get warning nor PDCM is set in
> cpu->filtered_features[], but PDCM is just not set in guest's CPUID.
>
> I couldn't understand how the warning or PDCM is set in
> cpu->filtered_features[] breaks save/restore and migration.
>
> > As a cross-check I reverted just and only 00268e0002 on top of
> > 10.1-rc2 and these use cases work again.
> >
> >> even without the
> >> mark_unavailable_features():
> >>
> >>       env->features[FEAT_1_ECX] &= ~CPUID_EXT_PDCM;
> >>
> >> So is it that PDCM is set in cpu->filtered_features[] causing the problem?
> >>
> >>> With regards,
> >>> Daniel
> >>
> >
> >
>


-- 
Christian Ehrhardt
Director of Engineering, Ubuntu Server
Canonical Ltd

Reply via email to