Sean Christopherson <[email protected]> writes: > On Sun, Apr 05, 2026, Thomas Lefebvre wrote: >> Hi, >> >> I'm seeing KVM_GET_CLOCK return values ~253 years in the future when >> running KVM inside a Hyper-V VM (nested virtualization). I tracked >> it down to an unsigned wraparound in __get_kvmclock() and have >> bpftrace data showing the exact failure. >> >> Setup: >> - Intel i7-11800H laptop running Windows with Hyper-V >> - L1 guest: Ubuntu 24.04, kernel 6.8.0, 4 vCPUs >> - Clocksource: hyperv_clocksource_tsc_page (VDSO_CLOCKMODE_HVCLOCK) >> - KVM running inside L1, hosting L2 guests >> >> Root cause: >> >> __get_kvmclock() does: >> >> hv_clock.tsc_timestamp = ka->master_cycle_now; >> hv_clock.system_time = ka->master_kernel_ns + ka->kvmclock_offset; >> ... >> data->clock = __pvclock_read_cycles(&hv_clock, data->host_tsc); >> >> and __pvclock_read_cycles() does: >> >> delta = tsc - src->tsc_timestamp; /* unsigned */ >> >> master_cycle_now is a raw RDTSC captured by >> pvclock_update_vm_gtod_copy(). host_tsc is a raw RDTSC read by >> __get_kvmclock() on the current CPU. Both go through the vgettsc() >> HVCLOCK path which calls hv_read_tsc_page_tsc() -- this computes a >> cross-CPU-consistent reference counter via scale/offset, but stores >> the *raw* RDTSC in tsc_timestamp as a side effect. >> >> Under Hyper-V, raw RDTSC values are not consistent across vCPUs. >> The hypervisor corrects them only through the TSC page scale/offset. >> If pvclock_update_vm_gtod_copy() runs on CPU 0 and __get_kvmclock() >> later runs on CPU 1 where the raw TSC is lower, the unsigned >> subtraction wraps. >> >> I wrote a bpftrace tracer (included below) to instrument both >> functions and captured two corruption events: >> >> Event 1: >> >> [GTOD_COPY] pid=2117649 cpu=0->0 use_master=1 >> mcn=598992030530137 mkn=259977082393200 >> >> [GET_CLOCK] pid=2117649 entry_cpu=1 exit_cpu=1 use_master=1 >> clock=8006399342167092479 host_tsc=598991848289183 >> master_cycle_now=598992030530137 >> system_time(mkn+off)=5175860260 >> TSC DEFICIT: 182240954 cycles >> >> master_cycle_now captured on CPU 0, host_tsc read on CPU 1. >> CPU 1's raw RDTSC was 182M cycles lower. >> >> 598991848289183 - 598992030530137 = 18446744073527310662 (u64) >> >> Returned clock: 8,006,399,342,167,092,479 ns (~253.7 years) >> Correct system_time: 5,175,860,260 ns (~5.2 seconds) >> >> Event 2: >> >> [GTOD_COPY] pid=2117953 cpu=0->0 use_master=1 >> mcn=599040238416510 >> >> [GET_CLOCK] pid=2117953 entry_cpu=3 exit_cpu=3 use_master=1 >> clock=8006399342464295526 host_tsc=599040211994220 >> master_cycle_now=599040238416510 >> TSC DEFICIT: 26422290 cycles >> >> Same pattern, CPU 0 vs CPU 3, 26M cycle deficit. >> >> kvm_get_wall_clock_epoch() has the same pattern -- fresh host_tsc >> vs stale master_cycle_now passed to __pvclock_read_cycles(). >> >> The simplest fix I can think of is guarding the __pvclock_read_cycles >> call in __get_kvmclock(): >> >> if (data->host_tsc >= hv_clock.tsc_timestamp) >> data->clock = __pvclock_read_cycles(&hv_clock, data->host_tsc); >> else >> data->clock = hv_clock.system_time; > > That might kinda sorta work for one KVM-as-the-host path, but it's not a > proper > fix. The actual guest-side (L2) reads in __pvclock_clocksource_read() will > also > be broken, because PVCLOCK_TSC_STABLE_BIT will be set. > > I don't see how this scenario can possibly work, KVM is effectively mixing two > time domains. The stable timestamp from the TSC page is (obviously) *derived* > from the raw, *unstable* TSC, but they are two distinct domains. > > What really confuses me is why we thought this would work for Hyper-V but not > for > kvmclock (i.e. KVM-on-KVM). Hyper-V's TSC page and kvmclock are the exact > same > concept, but vgettsc() only special cases VDSO_CLOCKMODE_HVCLOCK, not > VDSO_CLOCKMODE_PVCLOCK. > > Shouldn't we just revert b0c39dc68e3b ("x86/kvm: Pass stable clocksource to > guests > when running nested on Hyper-V")? > > Vitaly, what am I missing? >
It's probably me who's missing somethings :-) but my understanding is that we can't be using TSC page clocksource with unsyncronized TSCs in L1 at all as TSC page (unlike kvmclock) is always partition-wide and thus can't lead to a sane result in case raw TSC readings diverge. The idea of b0c39dc68e3b was that in Hyper-V guests *with stable, syncronized TSC* we may still be using Hyper-V TSC page clocksource and thus we can pass it to L2. -- Vitaly

