On Thu, Aug 7, 2025 at 10:09 AM Xiaoyao Li <xiaoyao...@intel.com> wrote: > > On 8/7/2025 2:37 PM, Christian Ehrhardt wrote: > > On Thu, Aug 7, 2025 at 5:38 AM Xiaoyao Li <xiaoyao...@intel.com> wrote: > >> > >> On 8/7/2025 3:18 AM, Daniel P. Berrangé wrote: > >>> On Wed, Aug 06, 2025 at 07:57:34PM +0200, Christian Ehrhardt wrote: > >>>> On Wed, Aug 6, 2025 at 2:00 PM Daniel P. Berrangé <berra...@redhat.com> > >>>> wrote: > >>>>> > >>>>> On Wed, Aug 06, 2025 at 01:52:17PM +0200, Christian Ehrhardt wrote: > >>>>>> Hi, > >>>>>> I was unsure if this would be better sent to libvirt or qemu - the > >>>>>> issue is somewhere between libvirt modelling CPUs and qemu 10.1 > >>>>>> behaving differently. I did not want to double post and gladly most of > >>>>>> the people are on both lists - since the switch in/out of the problem > >>>>>> is qemu 10.0 <-> 10.1 let me start here. I beg your pardon for not yet > >>>>>> having all the answers, I'm sure I could find more with debugging, but > >>>>>> I also wanted to report early for your awareness while we are still in > >>>>>> the RC phase. > >>>>>> > >>>>>> > >>>>>> # Problem > >>>>>> > >>>>>> What I found when testing migrations in Ubuntu with qemu 10.1-rc1 was: > >>>>>> error: operation failed: guest CPU doesn't match specification: > >>>>>> missing features: pdcm > >>>>>> > >>>>>> This is behaving the same with libvirt 11.4 or the more recent 11.6. > >>>>>> But switching back to qemu 10.0 confirmed that this behavior is new > >>>>>> with qemu 10.1-rc. > >>>>> > >>>>> > >>>>>> Without yet having any hard evidence against them I found a few pdcm > >>>>>> related commits between 10.0 and 10.1-rc1: > >>>>>> 7ff24fb65 i386/tdx: Don't mask off CPUID_EXT_PDCM > >>>>>> 00268e000 i386/cpu: Warn about why CPUID_EXT_PDCM is not available > >>>>>> e68ec2980 i386/cpu: Move adjustment of CPUID_EXT_PDCM before > >>>>>> feature_dependencies[] check > >>>>>> 0ba06e46d i386/tdx: Add TDX fixed1 bits to supported CPUIDs > >>>>>> > >>>>>> > >>>>>> # Caveat > >>>>>> > >>>>>> My test environment is in LXD system containers, that gives me issues > >>>>>> in the power management detection > >>>>>> libvirtd[406]: error from service: GDBus.Error:System.Error.EROFS: > >>>>>> Read-only file system > >>>>>> libvirtd[406]: Failed to get host power management capabilities > >>>>> > >>>>> That's harmless. > >>>> > >>>> Yeah, it always was for me - thanks for confirming. > >>>> > >>>>>> And the resulting host-model on a rather old test server will > >>>>>> therefore have: > >>>>>> <cpu mode='custom' match='exact' check='full'> > >>>>>> <model fallback='forbid'>Haswell-noTSX-IBRS</model> > >>>>>> <vendor>Intel</vendor> > >>>>>> <feature policy='require' name='vmx'/> > >>>>>> <feature policy='disable' name='pdcm'/> > >>>>>> ... > >>>>>> > >>>>>> But that was fine in the past, and the behavior started to break > >>>>>> save/restore or migrations just now with the new qemu 10.1-rc. > >>>>>> > >>>>>> # Next steps > >>>>>> > >>>>>> I'm soon overwhelmed by meetings for the rest of the day, but would be > >>>>>> curious if one has a suggestion about what to look at next for > >>>>>> debugging or a theory about what might go wrong. If nothing else comes > >>>>>> up I'll try to set up a bisect run tomorrow. > >>>>> > >>>>> Yeah, git bisect is what I'd start with. > >>>> > >>>> Bisect complete, identified this commit > >>>> > >>>> commit 00268e00027459abede448662f8794d78eb4b0a4 > >>>> Author: Xiaoyao Li <xiaoyao...@intel.com> > >>>> Date: Tue Mar 4 00:24:50 2025 -0500 > >>>> > >>>> i386/cpu: Warn about why CPUID_EXT_PDCM is not available > >>>> > >>>> When user requests PDCM explicitly via "+pdcm" without PMU > >>>> enabled, emit > >>>> a warning to inform the user. > >>>> > >>>> Signed-off-by: Xiaoyao Li <xiaoyao...@intel.com> > >>>> Reviewed-by: Zhao Liu <zhao1....@intel.com> > >>>> Link: > >>>> https://lore.kernel.org/r/20250304052450.465445-3-xiaoyao...@intel.com > >>>> Signed-off-by: Paolo Bonzini <pbonz...@redhat.com> > >>>> > >>>> target/i386/cpu.c | 3 +++ > >>>> 1 file changed, 3 insertions(+) > >>>> > >>>> > >>>> > >>>> Which is odd as it should only add a warning right? > >>> > >>> No, that commit message is misleading. > >>> > >>> IIUC mark_unavailable_features() actively blocks usage of the feature, > >>> so it is a functional change, not merely a emitting warning. > >>> > >>> It makes me wonder if that commit was actually intended to block the > >>> feature or not, vs merely warning ? CC'ing those involved in the > >>> commit. > >> > >> The intention was to print a warning to tell users PDCM cannot be > >> enabled if pmu is not enabled. While mark_unavailable_features() does > >> has the effect of setting the bit in cpu->filtered_features[]. > >> > >> But the feature is masked off anyway > > > > Right - it was disabled right from the beginning. > > As I reported libvirt detected it as not available and constructed the > > CPU as with it disabled. > > Which translated it into -cpu ...,pdcm=off,... > > > > The new and bad aspect we need to overcome is that in these conditions > > this now somehow breaks save/restore and migration operations. > > The commit 00268e0002 makes a difference only for the case "-cpu > xxx,pdcm=on" without "pmu=on", and it emits a warning and sets the PDCM > in cpu->filtered_features[].
But this is `pdcm=off` as I said above, yet with/without the change it breaks the mentioned migration and save/restors. But since you mentioned pmu, that isn't mentioned in the qemu cmdline arguments that libvirt used and the base type is Haswell-noTSX-IBRS. > So libvirt must first request with "-cpu xxx,pdcm=on" without "pmu=on" > and gets the result that PDCM is filtered (set in cpu->filtered_features[]). > > This indeed introduces the behavior change that before the commit, "-cpu > xxx,pdcm=on" without "pmu=on" doesn't get warning nor PDCM is set in > cpu->filtered_features[], but PDCM is just not set in guest's CPUID. > > I couldn't understand how the warning or PDCM is set in > cpu->filtered_features[] breaks save/restore and migration. > > > As a cross-check I reverted just and only 00268e0002 on top of > > 10.1-rc2 and these use cases work again. > > > >> even without the > >> mark_unavailable_features(): > >> > >> env->features[FEAT_1_ECX] &= ~CPUID_EXT_PDCM; > >> > >> So is it that PDCM is set in cpu->filtered_features[] causing the problem? > >> > >>> With regards, > >>> Daniel > >> > > > > > -- Christian Ehrhardt Director of Engineering, Ubuntu Server Canonical Ltd