On Tue, 2026-05-19 at 14:23 -0700, Dongli Zhang wrote: > I think I now understand why I feel like I am always asking weird questions. I > have been thinking about how to account for downtime, so I see > KVM_SET_CLOCK_GUEST as a supplement to KVM_SET_CLOCK.
I do not believe in "downtime". There is no such thing. There is only "steal time". A CPU may be off in the weeds — a vCPU suffering steal time, or even a pCPU in SMM which is effectively the same thing — but time doesn't stop, and neither does the TSC. > Suppose we are not going to account for any downtime. With > KVM_SET_CLOCK_GUEST: > > 1. The masterclock is active, so gTSC is synchronized across vCPUs. All vCPUs > share the same kvm_read_l1_tsc(v, ka->master_cycle_now). Strictly, by the time we get to the end of my series, masterclock is active *because* all the vCPUs are running at the same TSC rate (even if the guest set them to different offsets). But OK. > 2. Migrate the gTSC to the target VM however people want (either ablolute > value > or offset value). (Optional) Account for downtime in gTSC however people want, > even with KVM_SET_CLOCK/KVM_CLOCK_REALTIME, which you may not like. > > 3. Adjust kvm-clock (that is, ka->kvmclock_offset) with KVM_SET_CLOCK_GUEST. > > That is why you think KVM_SET_CLOCK is no longer required if we have > KVM_SET_CLOCK_GUEST. While I think KVM_SET_CLOCK is required because of > KVM_CLOCK_REALTIME. If I recall correctly what we described in https://lore.kernel.org/all/[email protected]/ I don't think we actually needed KVM_SET_CLOCK at all, did we? We *abuse* KVM_GET_CLOCK to give us a tuple of {realtime, host TSC} because there's actually no other way for *userspace* to get that. We don't actually *care* about the KVM clock part. We use the {realtime, host TSC} pair to reconstitute the guest TSC values to correctly reflect the passing of time while the guest was in the ether. > It it isn't required to account any downtime for gTSC or if there is another > way > to do so, only KVM_SET_CLOCK_GUEST is enough. Right. If you only want the guest to come back with the *same* values in its TSC as before the migration, as if the TSC was *paused* during the migration, then you can just restore those values and use KVM_SET_CLOCK_GUEST. Assuming you are on modern hardware and have set all vCPUs to the same rate (and are using this series so the *guest* can't break masterclock for you, and you can trust the KVM_SET_CLOCK_GUEST will work). > > > > > Another scenario is when only MASTERCLOCK_UPDATE is pending and there is > > > no > > > pending CLOCK_UPDATE. > > > > > > In this scenario, is it fine to skip processing MASTERCLOCK_UPDATE before > > > saving > > > pvclock_vcpu_time_info? > > > > > > > I'm not sure I understand that scenario. > > > > MASTERCLOCK_UPDATE means we have to actually recalculate the master > > clock (which really *should* be rare, now!). And then any time we do > > that, we also have to do a CLOCK_UPDATE on every vCPU to disseminate > > the new information. Which is why kvm_end_pvclock_update() does exactly > > that. > > > > So your "MASTERCLOCK_UPDATE is pending and there is no pending > > CLOCK_UPDATE" doesn't make much sense to me. If MASTERCLOCK_UPDATE is > > pending, then there *will* be a CLOCK_UPDATE pending. > > Suppose the VM is stopped and the master clock is active. I don't know what it means for a VM to be 'stopped'. Do you mean that all vCPUs happen to be experiencing steal time at the present moment? > Suddenly, we change the host clocksource from TSC to HPET. > pvclock_gtod_notify() > may call pvclock_gtod_update_fn() to set a pending KVM_REQ_MASTERCLOCK_UPDATE > for all vCPUs. Unless the pending KVM_REQ_MASTERCLOCK_UPDATE is processed by > kvm_update_masterclock(), kvm_end_pvclock_update() will not set a pending > KVM_REQ_CLOCK_UPDATE. You say 'Unless'... do you mean 'Until'? > Therefore, this is a scenario in which only KVM_REQ_MASTERCLOCK_UPDATE is > pending. > > I do not think this scenario is important. I am just curious about the > expected > way to implement similar code in the future :) I think that's working correctly. Until the master clock has *actually* been updated, there's no point in setting CLOCK_UPDATE for each vCPU to disseminate the new information to its own pvclock? > > > > > > > > > > > > > > Would it be helpful to validate that the delta is within a reasonable > > > > > range, > > > > > e.g. that the drift can never be more than five minutes (forward or > > > > > backward)? > > > > > > > > If a guest has been running for months on a previous host and is > > > > migrated to a new host, don't we expect that the KVM clock of the new > > > > VM on the new host is tweaked from its default near-zero after > > > > creation, to some large amount? > > > > > > > > > > Regarding live migration, my own investigation does not show a > > > proportional > > > relationship between VM uptime and the amount of drift. > > > > You're comparing the VM on the source host, with the VM on the > > destination post-migration. > > Apologies for making it confusing. I was just trying to explain why I think > the > kvm-clock drift will not be large. Sure, but I don't care. If we have a sane API, the drift should be zero :) > We previously discussed the vCPU hotplug and kvm-clock drift issue. The longer > the time interval between two vCPU hotplug events, the larger the drift. > > For live migration (with QEMU), I provided the equation to show that the drift > will not be large, because it is determined by something else rather than by > how > long the VM has been running on the source server. > > > For the previous vCPU hotplug and kvm-clock bug, if we add more vCPUs to a > guest > that has been running for three months, the drift will be relatively larger. > > For QEMU live migration, migrating a guest VM that has been running on the > source host for *three months* versus one that has been running for *one day* > will not cause much difference in kvm-clock drift. Right. > For the ideal live update case (on the same host), there may be no need to > adjust gTSC so that it keeps incrementing. In that case, KVM_SET_CLOCK_GUEST > can > be used to adjust kvm-clock based on gTSC. Right. You restore the gTSC using its *offset* from the host TSC which hasn't stopped counting on the same host. Then use KVM_SET_CLOCK_GUEST to restore the kvmclock in terms of the gTSC. And you have an absolutely cycle-perfect migration. > For the live migration scenario, the current QEMU implementation not only > fails > to account for downtime, but also has a drift issue. That is what I would like > to address in QEMU. Again, restore the gTSC as accurately as possible. Probably by working out for *yourself* the relationships of the source and destination host TSCs to real time, and then reconstituting on the destination using TSC offset just as for live migration. And then use KVM_SET_CLOCK_GUEST too. That's what I attempted to document in https://lore.kernel.org/all/[email protected]/ and should probably revive.
smime.p7s
Description: S/MIME cryptographic signature

