On 2026-05-19 3:43 PM, David Woodhouse wrote:
> On Tue, 2026-05-19 at 14:23 -0700, Dongli Zhang wrote:
>> I think I now understand why I feel like I am always asking weird questions. 
>> I
>> have been thinking about how to account for downtime, so I see
>> KVM_SET_CLOCK_GUEST as a supplement to KVM_SET_CLOCK.
> 
> I do not believe in "downtime". There is no such thing.
> There is only "steal time".

Or "leap seconds" as used in the document?

https://lore.kernel.org/all/[email protected]


> 
> If I recall correctly what we described in
> https://lore.kernel.org/all/[email protected]/
> I don't think we actually needed KVM_SET_CLOCK at all, did we?

Here I partially copied the content from the link.

The 2nd step of destination VMM is to invoke KVM_SET_CLOCK ioctl.

---
 From the destination VMM process:

-4. Invoke the KVM_SET_CLOCK ioctl, providing the source nanoseconds from
-   kvmclock (guest_src) and CLOCK_REALTIME (host_src) in their respective
+4. Before creating the vCPUs, invoke the KVM_SET_TSC_KHZ ioctl on the VM, to
+   set the scaled frequency of the guest's TSC (freq).
+
+5. Invoke the KVM_SET_CLOCK ioctl, providing the source nanoseconds from
+   kvmclock (guest_src) and CLOCK_REALTIME (time_src) in their respective
    fields.  Ensure that the KVM_CLOCK_REALTIME flag is set in the provided
    structure.

-   KVM will advance the VM's kvmclock to account for elapsed time since
-   recording the clock values.  Note that this will cause problems in
+   KVM will restore the VM's kvmclock, accounting for elapsed time since
+   the clock values were recorded.  Note that this will cause problems in
    the guest (e.g., timeouts) unless CLOCK_REALTIME is synchronized
    between the source and destination, and a reasonably short time passes
-   between the source pausing the VMs and the destination executing
-   steps 4-7.
+   between the source pausing the VMs and the destination resuming them.
+   Due to the KVM_[SG]ET_CLOCK API using CLOCK_REALTIME instead of
+   CLOCK_TAI, leap seconds during the migration may also introduce errors.
--


>>>
>>> So your "MASTERCLOCK_UPDATE is pending and there is no pending
>>> CLOCK_UPDATE" doesn't make much sense to me. If MASTERCLOCK_UPDATE is
>>> pending, then there *will* be a CLOCK_UPDATE pending.
>>
>> Suppose the VM is stopped and the master clock is active.
> 
> I don't know what it means for a VM to be 'stopped'. Do you mean that
> all vCPUs happen to be experiencing steal time at the present moment?

Taking QEMU as an example, all vCPU threads remain asleep in host userspace
without having a chance to invoke KVM_RUN. As a result, none of the vCPUs can
enter KVM kernel mode to process any pending requests.

This is the state before QEMU resumes from live migration or live update.

(qemu) stop

(qemu) info status
VM status: paused


According to my understanding, older KVM versions even required the userspace
VMM to keep vCPUs in userspace to avoid racing with KVM_RUN.

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/kvm/x86.c?h=v5.15#n6090

        case KVM_SET_CLOCK: {
... ...
                /*
                 * TODO: userspace has to take care of races with VCPU_RUN, so
                 * kvm_gen_update_masterclock() can be cut down to locked
                 * pvclock_update_vm_gtod_copy().
                 */

> 
>> Suddenly, we change the host clocksource from TSC to HPET. 
>> pvclock_gtod_notify()
>> may call pvclock_gtod_update_fn() to set a pending KVM_REQ_MASTERCLOCK_UPDATE
>> for all vCPUs. Unless the pending KVM_REQ_MASTERCLOCK_UPDATE is processed by
>> kvm_update_masterclock(), kvm_end_pvclock_update() will not set a pending
>> KVM_REQ_CLOCK_UPDATE.
> 
> You say 'Unless'... do you mean 'Until'?

Until.

> 
>> Therefore, this is a scenario in which only KVM_REQ_MASTERCLOCK_UPDATE is 
>> pending.
>>
>> I do not think this scenario is important. I am just curious about the 
>> expected
>> way to implement similar code in the future :)
> 
> I think that's working correctly. Until the master clock has *actually*
> been updated, there's no point in setting CLOCK_UPDATE for each vCPU to
> disseminate the new information to its own pvclock?

Thank you very much for helping confirm this.


> 
>> For the live migration scenario, the current QEMU implementation not only 
>> fails
>> to account for downtime, but also has a drift issue. That is what I would 
>> like
>> to address in QEMU.
> 
> Again, restore the gTSC as accurately as possible. Probably by working
> out for *yourself* the relationships of the source and destination host
> TSCs to real time, and then reconstituting on the destination using TSC
> offset just as for live migration.
> 
> And then use KVM_SET_CLOCK_GUEST too.
> 
> That's what I attempted to document in
> https://lore.kernel.org/all/[email protected]/
> and should probably revive.

I would really appreciate it if this document could be revived. I don't see it
in your most recent v4 PATCH 7. It is very helpful as a guideline for how
userspace VMMs should take advantage of these APIs.

Thank you very much!

Dongli Zhang


Reply via email to