Re: [PATCH v4 04/30] KVM: x86: Add KVM_[GS]ET_CLOCK_GUEST for accurate KVM clock migration

David Woodhouse Tue, 19 May 2026 00:56:15 -0700

On Mon, 2026-05-18 at 17:57 -0700, Dongli Zhang wrote:
> On 2026-05-18 1:48 AM, David Woodhouse wrote:
> > ...
> 
> I have fixed the Thunderbird configuration. Does it look better to you?


The date is certainly better, thank you. But although I *was* up late
that night frowning at clocks, I didn't think I was up *quite* as late
(almost 2am) as it suggests.

But I suspect that getting *that* right is beyond the limit of
Thunderbird's configurability.

Thanks :)

> I really appreciate guidelines like the ones below.
> 
> https://lore.kernel.org/all/[email protected]
> 
> Assuming I am a user of the new API, I feel confused about whether the goal is
> to replace KVM_SET_CLOCK with KVM_SET_CLOCK_GUEST, or whether the latter is
> meant to supplement the former.

The issue is that KVM_SET_CLOCK_GUEST can only be used in 'masterclock'
mode, when the TSC is reliable and the guest TSCs are all in sync.

Which ought to be *all* of the time, on modern hardware and sane
configurations. And in this series, I don't even let the *guest* screw
that over by setting different TSC offsets on different vCPUs any more
(we stay in masterclock mode in that case now). But the VMM can cause
its guest to come out of masterclock mode, by setting different TSC
*speeds* on different vCPUs.

So there remain some pathological cases where the kvmclock actually
still has a justification to exist, and those are the cases where it
needs to be set in its own right as a function of host time
(KVM_SET_CLOCK), not purely as a function of the guest TSC
(KVM_SET_CLOCK_GUEST).


> 
> If we are going to use KVM_SET_CLOCK_GUEST when KVM_SET_CLOCK is not needed, I
> would appreciate it if the API could carry more data in addition to struct
> pvclock_vcpu_time_info.
> 
> +#define KVM_SET_CLOCK_GUEST    _IOW(KVMIO, 0xd6, struct 
> pvclock_vcpu_time_info)
> +#define KVM_GET_CLOCK_GUEST    _IOR(KVMIO, 0xd7, struct 
> pvclock_vcpu_time_info)
> 
> 
> In the future, if we need to carry additional data, we could simply reuse the
> padding fields instead of introducing another KVM_SET_CLOCK_GUEST2.
> 
> The following is an example of how additional data could be carried.
> 
> KVM: x86: Report host tsc and realtime values in KVM_GET_CLOCK
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c68dc1b577eabd5605c6c7c08f3e07ae18d30d5d

I'm not very keen on the way that KVM_[GS]ET_CLOCK threw extra time
references in over the years without any of them ever actually making
*sense*, which makes me reluctant. 

KVM_[GS]ET_CLOCK_GUEST on the other hand does *one* thing: it exports
the relationship between the guest TSC and kvmclock. We don't *need* to
litter it with time values from other clocks willy-nilly.

But yes, you are right in principle.

And in fact, the other part of this conversation has really drawn my
attention to the ugliness of the "try KVM_SET_CLOCK_GUEST and if that
doesn't take then fall back to KVM_SET_CLOCK" which we are pushing onto
userspace.

Yes, I covered it in the guidelines: but we should always abide by the
mantra, "if it *needs* documenting, fix it first".

So perhaps we could build a variant which does both at once. It
provides the guest clock as a function of guest TSC for the sane
masterclock case, but *also* includes CLOCK_TAI at the same moment, in
case it needs to fall back to that.

Although... the actual reference time part of the existing pvclock
might be significantly in the past, especially with all the
MASTERCLOCK_UPDATE elimination that we've been doing. And we'd have to
regenerate it to get a simultaneous realtime reading, wouldn't we?
> > 

> > > Do we need a KVM_CHECK_EXTENSION capability for this? If userspace wants 
> > > to
> > > support the new API, should it detect availability via 
> > > KVM_CHECK_EXTENSION, or
> > > simply try the ioctl and handle failure?
> > 
> > That might be conventional, I suppose. But I suspect Jack's thinking
> > was that userspace is going to have to *try* it anyway, and still might
> > have to fall back to what KVM_SET_CLOCK can manage, so userspace
> > probably wouldn't even bother to check that capability; it doesn't
> > matter.
> > 
> > Since then, we've added some more attributes in this series though, and
> > it probably is worth adding a cap which advertises them *all*?
> > Something like KVM_CAP_CLOCK_PRECISION_API?
> 
> From an API user's perspective, userspace may need to distinguish between an 
> API
> failure and the API not being available.

That's -ENOTTY vs. -EINVAL for the ioctl, isn't it? And isn't there
something similarly unambiguous for the attributes?

But I have no objection to adding a capability. The lack of it was more
oversight than intent.


> > > 
> > > From my perspective, I am also curious how we should reason about this in 
> > > other
> > > scenarios in the future. Specifically, when do we need to process
> > > KVM_REQ_MASTERCLOCK_UPDATE before KVM_REQ_CLOCK_UPDATE, and when is it
> > > acceptable not to? I noticed that kvm_cpuid() already processes only
> > > KVM_REQ_CLOCK_UPDATE.
> > 
> > The way I've been thinking about it — and I'm only two cups of coffee
> > into Monday so take those words literally and don't think of them as
> > British understatement of something I believe is absolute truth — is
> > that MASTERCLOCK_UPDATE is updating the actual clock for the whole VM,
> > while CLOCK_UPDATE is about *putting* that information into the per-
> > vCPU pvclock structures.
> > 
> > So after a MASTERCLOCK_UPDATE, we need to do a CLOCK_UPDATE on all
> > vCPUs to disseminate the result. Which means that if CLOCK_UPDATE is
> > already pending before a MASTERCLOCK_UPDATE, it's probably redundant
> > and might as well be cleared because it's only going to get set *again*
> > in kvm_end_pvclock_update()? 
> 
> Another scenario is when only MASTERCLOCK_UPDATE is pending and there is no
> pending CLOCK_UPDATE.
> 
> In this scenario, is it fine to skip processing MASTERCLOCK_UPDATE before 
> saving
> pvclock_vcpu_time_info?
> 

I'm not sure I understand that scenario. 

MASTERCLOCK_UPDATE means we have to actually recalculate the master
clock (which really *should* be rare, now!). And then any time we do
that, we also have to do a CLOCK_UPDATE on every vCPU to disseminate
the new information. Which is why kvm_end_pvclock_update() does exactly
that.

So your "MASTERCLOCK_UPDATE is pending and there is no pending
CLOCK_UPDATE" doesn't make much sense to me. If MASTERCLOCK_UPDATE is
pending, then there *will* be a CLOCK_UPDATE pending.


> > > 
> > > Would it be helpful to validate that the delta is within a reasonable 
> > > range,
> > > e.g. that the drift can never be more than five minutes (forward or 
> > > backward)?
> > 
> > If a guest has been running for months on a previous host and is
> > migrated to a new host, don't we expect that the KVM clock of the new
> > VM on the new host is tweaked from its default near-zero after
> > creation, to some large amount?
> > 
> 
> Regarding live migration, my own investigation does not show a proportional
> relationship between VM uptime and the amount of drift.

You're comparing the VM on the source host, with the VM on the
destination post-migration.

Perhaps I misunderstood, but I thought your suggested validation of a
'reasonable range' would also apply when adjusting the kvmclock of the
nascent VM on the destination host, from "newly created" to "has been
running for months" while migrating the state of the actual guest onto
a clean new slate.

> Just taking QEMU + KVM as an example: suppose TSC scaling is inactive, the
> amount of drift does not depend on how long the VM has been running before 
> live
> migration.
> 
> Instead, it depends on the delta between when we call MSR_IA32_TSC and
> KVM_GET_CLOCK, and between MSR_IA32_TSC and KVM_SET_CLOCK.
> 
> The guest TSC stops at P1 and resumes at P3.
> The kvmclock stops at P2 and resumes at P4.
> 
> We expect P1 == P2 and P3 == P4.
> 
> On source host.
> 
> - kvm_get_msr_common(MSR_IA32_TSC) for vCPU=0 ===> P1

Here's where it all starts going wrong. Line 1.

Any API which lets you get a single time value in isolation, and thus
which is already out of date by the time the system call even returns,
is fundamentally unsuitable for migration.

> - kvm_get_msr_common(MSR_IA32_TSC) for vCPU=1
> - kvm_get_msr_common(MSR_IA32_TSC) for vCPU=2
> - kvm_get_msr_common(MSR_IA32_TSC) for vCPU=3
> - kvm_get_msr_common(MSR_IA32_TSC) for vCPU=4
> ... ...
> - kvm_get_msr_common(MSR_IA32_TSC) for vCPU=N
> - KVM_GET_CLOCK                               ===> P2
> 
> On target host.
> 
> - kvm_set_msr_common(MSR_IA32_TSC) for vCPU=1 ===> P3
> - kvm_set_msr_common(MSR_IA32_TSC) for vCPU=2

At this point, the nasty hack in the kernel steps in, realises that the
value you're setting on vCPU 2 is within a second or so of the value
you had previously set on vCPU 1, and snaps it back to be precisely the
same. To work around the fundamental brokenness of this method.

> - kvm_set_msr_common(MSR_IA32_TSC) for vCPU=3
> - kvm_set_msr_common(MSR_IA32_TSC) for vCPU=4
> - kvm_set_msr_common(MSR_IA32_TSC) for vCPU=5
> ... ...
> - kvm_set_msr_common(MSR_IA32_TSC) for vCPU=N
> - KVM_SET_CLOCK                               ====> P4
> 
> 
> Here is my equiation to predict the drift.

I'm sure you're right, but I didn't get that far when looking at this.
I'd already thrown up in my mouth a little bit by line one.

Here's my equation to predict the drift of a live update done correctly
on the same host using the method I've now put in the documentation:

0.

:)

smime.p7s
Description: S/MIME cryptographic signature

Re: [PATCH v4 04/30] KVM: x86: Add KVM_[GS]ET_CLOCK_GUEST for accurate KVM clock migration

Reply via email to