David Woodhouse <[email protected]> writes:
...
>
> Then in 2018, Vitaly Kuznetsov added Hyper-V TSC page support in
> commit b0c39dc68e3b ("x86/kvm: Pass stable clocksource to guests when
> running nested on Hyper-V"), which extended vgettsc() to handle the
> HVCLOCK case.
>
> I'd quite like to kill it all with fire and make KVM use
> ktime_get_snapshot() instead.
The main motivation is reducing the complexity of KVM's timekeeping code
I guess?
>
> However, to correlate with the TSC provided to guests, KVM needs the
> underlying host TSC counter value, *not* the cycles count from the
> hyperv_clocksource_tsc_page clocksource which is scaled to 10MHz.
>
> If we wanted to support master clock mode while nesting under KVM and
> bizarrely using the kvmclock for system timing, we'd have the same
> problem with the kvmclock clocksource, which similarly scales to 1GHz.
>
> One option is to say "Don't Do That Then™": if you want to provide a
> masterclock kvmclock to guests then *don't* use the silly pvclocks for
> your own kernel's timekeeping, use the damn TSC. Because if the TSC
> *isn't* reliable then you can't do masterclock mode for your guests
> anyway.
The statement "TSC isn't reliable" deserves a book of its own :-)
Historically, we've seen all sorts of issues with it, but by the time of
b0c39dc68e3b, they were mostly gone. The real problem the Hyper-V/Azure
folks were solving back then was that while the TSC *was* reliable
(synchronized across CPUs, not jumping backwards, stable frequency,
...), tons of hardware out there (Azure is quite big) did not support
TSC scaling. VMs on Azure don't migrate very often, but they do migrate
when hardware maintenance is needed. Migrating to a host with a
different TSC frequency would've been a problem, so the Hyper-V TSC page
was introduced. Note: it is a *single* page for all CPUs, so the
clocksource was never intended to be used in a situation where TSCs are
unsynchronized across CPUs.
To deal with migrations, the Hyper-V folks came up with a mechanism
called 'reenlightenment notifications', and we support it in KVM. It's
not really great, as we need to stop all the nested VMs, but it does the
job: we can re-compute guest PV clocksources (kvmclock, TSC page,
... Xen?) and live happily ever after.
>
> Perhaps that should have been the response when commit b0c39dc68e3b was
> submitted, but I guess we're stuck supporting that mode now.
Times are changing, and it is becoming increasingly difficult to find
x86 hardware without TSC scaling support. Linux guests on Hyper-V now
prefer TSC if possible (HV_ACCESS_TSC_INVARIANT; see, e.g., commit
4c78738ead4e), so I expect that in a few years, there will be no need
for the Hyper-V TSC page clocksource or the reenlightenment logic
anyway.
> But I really do want to kill the KVM hacks and use ktime_get_snapshot().
>
> Reverse-engineering the original TSC reading from the clocksource
> counter value doesn't look sane, without a loss of precision and/or
> 128-bit division.
>
> One simple option that occurs to me would be to add a 'cycles_raw'
> value to the system_time_snapshot, for PV clocksources like hyperv and
> kvmclock to populate with the original TSC reading.
Personally, I don't see this as such an ugly hack.
>
> That might actually let us clean up some of the PTP code that currently
> has to deal with TSC vs. kvmclock in counter snapshots too. I think I
> could kill the use of get_cycles() in vmclock for the kvmclock case,
> which might make Thomas happy...
>
> Any better ideas?
--
Vitaly