kvmclock: Use TSC for sched_clock if it's constant and non-stop

Michael Kelley Wed, 12 Feb 2025 08:44:53 -0800

From: Sean Christopherson <sea...@google.com> Sent: Monday, February 10, 2025 
8:22 AM
> 
> On Sat, Feb 08, 2025, Michael Kelley wrote:
> > From: Sean Christopherson <sea...@google.com> Sent: Friday, February 7, 
> > 2025 9:23 AM
> > >
> > > Dropping a few people/lists whose emails are bouncing.
> > >
> > > On Fri, Jan 31, 2025, Sean Christopherson wrote:
> > > > @@ -369,6 +369,11 @@ void __init kvmclock_init(void)
> > > >  #ifdef CONFIG_X86_LOCAL_APIC
> > > >         x86_cpuinit.early_percpu_clock_init = kvm_setup_secondary_clock;
> > > >  #endif
> > > > +       /*
> > > > +        * Save/restore "sched" clock state even if kvmclock isn't 
> > > > being used
> > > > +        * for sched_clock, as kvmclock is still used for wallclock and 
> > > > relies
> > > > +        * on these hooks to re-enable kvmclock after suspend+resume.
> > >
> > > This is wrong, wallclock is a different MSR entirely.
> > >
> > > > +        */
> > > >         x86_platform.save_sched_clock_state = 
> > > > kvm_save_sched_clock_state;
> > > >         x86_platform.restore_sched_clock_state = 
> > > > kvm_restore_sched_clock_state;
> > >
> > > And usurping sched_clock save/restore is *really* wrong if kvmclock isn't 
> > > being
> > > used as sched_clock, because when TSC is reset on suspend/hiberation, not 
> > > doing
> > > tsc_{save,restore}_sched_clock_state() results in time going haywire.
> > >
> > > Subtly, that issue goes all the way back to patch "x86/paravirt: Don't 
> > > use a PV
> > > sched_clock in CoCo guests with trusted TSC" because pulling the rug out 
> > > from
> > > under kvmclock leads to the same problem.
> > >
> > > The whole PV sched_clock scheme is a disaster.
> > >
> > > Hyper-V overrides the save/restore callbacks, but _also_ runs the old TSC 
> > > callbacks,
> > > because Hyper-V doesn't ensure that it's actually using the Hyper-V clock 
> > > for
> > > sched_clock.  And the code is all kinds of funky, because it tries to 
> > > keep the
> > > x86 code isolated from the generic HV clock code, but (a) there's already 
> > > x86 PV
> > > specific code in drivers/clocksource/hyperv_timer.c, and (b) splitting 
> > > the code
> > > means that Hyper-V overides the sched_clock save/restore hooks even when
> > > PARAVIRT=n, i.e. when HV clock can't possibly be used as sched_clock.
> >
> > Regarding (a), the one occurrence of x86 PV-specific code hyperv_timer.c is
> > the call to paravirt_set_sched_clock(), and it's under an #ifdef sequence 
> > so that
> > it's not built if targeting some other architecture. Or do you see 
> > something else
> > that is x86-specific?
> >
> > Regarding (b), in drivers/hv/Kconfig, CONFIG_HYPERV always selects PARAVIRT.
> > So the #else clause (where PARAVIRT=n) in that #ifdef sequence could 
> > arguably
> > have a BUILD_BUG() added. If I recall correctly, other Hyper-V stuff breaks 
> > if
> > PARAVIRT is forced to "n". So I don't think there's a current problem with 
> > the
> > sched_clock save/restore hooks. i
> 
> Oh, there are no build issues, and all of the x86 bits are nicely cordoned 
> off.
> My complaint is essentially that they're _too_ isolated; putting the 
> sched_clock
> save/restore setup in arch/x86/kernel/cpu/mshyperv.c is well-intentioned, but 
> IMO
> it does more harm than good because the split makes it difficult to connect 
> the
> dots to hv_setup_sched_clock() in drivers/clocksource/hyperv_timer.c.
> 
> > But I would be good with some restructuring so that setting the sched clock
> > save/restore hooks is more closely tied to the sched clock choice,
> 
> Yeah, this is the intent of my ranting.  After the dust settles, the code can
> look like this.


I'm good with what you are proposing. And if you want, there's no real need
for hv_ref_counter_at_suspend and hv_save/restore_sched_clock_state()
to be in the #ifdef sequence since the code has no architecture dependencies.
Sure, the only current caller is x86-specific, but the functionality is generic
and might useful on some other architecture in the future. That was the
essence of my comment on the original patch that added this code [1].
The patch author took it a step further and moved all the code into an
x86-specific module, which I was OK with at the time. But your moving
it back is probably better.

[1] 
https://lore.kernel.org/all/sn6pr02mb4157141dd58fd6eae96c7cabd4...@sn6pr02mb4157.namprd02.prod.outlook.com/

> 
> ---
> #ifdef CONFIG_GENERIC_SCHED_CLOCK
> static __always_inline void hv_setup_sched_clock(void *sched_clock)
> {
>       /*
>        * We're on an architecture with generic sched clock (not x86/x64).
>        * The Hyper-V sched clock read function returns nanoseconds, not
>        * the normal 100ns units of the Hyper-V synthetic clock.
>        */
>       sched_clock_register(sched_clock, 64, NSEC_PER_SEC);
> }
> #elif defined CONFIG_PARAVIRT
> static u64 hv_ref_counter_at_suspend;
> /*
>  * Hyper-V clock counter resets during hibernation. Save and restore clock
>  * offset during suspend/resume, while also considering the time passed
>  * before suspend. This is to make sure that sched_clock using hv tsc page
>  * based clocksource, proceeds from where it left off during suspend and
>  * it shows correct time for the timestamps of kernel messages after resume.
>  */
> static void hv_save_sched_clock_state(void)
> {
>       hv_ref_counter_at_suspend = hv_read_reference_counter();
> }
> 
> static void hv_restore_sched_clock_state(void)
> {
>       /*
>        * Adjust the offsets used by hv tsc clocksource to
>        * account for the time spent before hibernation.
>        * adjusted value = reference counter (time) at suspend
>        *                - reference counter (time) now.
>        */
>       hv_sched_clock_offset -= (hv_ref_counter_at_suspend -
> hv_read_reference_counter());
> }
> 
> static __always_inline void hv_setup_sched_clock(void *sched_clock)
> {
>       /* We're on x86/x64 *and* using PV ops */
>       paravirt_set_sched_clock(sched_clock, hv_save_sched_clock_state,
>                                hv_restore_sched_clock_state);
> }
> #else /* !CONFIG_GENERIC_SCHED_CLOCK && !CONFIG_PARAVIRT */
> static __always_inline void hv_setup_sched_clock(void *sched_clock) {}
> #endif /* CONFIG_GENERIC_SCHED_CLOCK */
> ---
> 
> > as long as the architecture independence of hyperv_timer.c is preserved.
> 
> LOL, ah yes, the architecture independence of MSRs and TSC :-)

This is a digression from what you are trying to accomplish, but the function
and symbols names reflect history and are misleading. The terms "TSC" and
"MSR" are used, but arch-specific wrapper functions map "TSC" to the arm64
architectural system counter, for example. And the MSR references are all to
Hyper-V synthetic MSRs, which are mapped to the arm64 equivalent
registers and are accessed via explicit hypercalls instead of rdmsr()/wrmsr(). I
wish we had better terminology to use in the generic code.

> 
> Teasing aside, the code is firmly x86-only at the moment.  It's selectable 
> only
> by x86:
> 
>   config HYPERV_TIMER
>       def_bool HYPERV && X86
> 
> and since at least commit e39acc37db34 ("clocksource: hyper-v: Provide noinstr
> sched_clock()") there are references to symbols/functions that are provided 
> only
> by x86.
> 
> I assume arm64 support is a WIP, but keeping the upstream code arch 
> independent
> isn't very realistic if the code can't be at least compile-tested.  To help
> drive-by contributors like myself, maybe select HYPER_TIMER on arm64 for
> COMPILE_TEST=y builds?

Ah yes. I think I missed that commit e39acc37db34 added hv_raw_get_register(),
which doesn't have an arm64 equivalent. I had not previously known about
COMPILE_TEST=y. Thanks for pointing that out, and I'll check into using it.

Michael

> 
>   config HYPERV_TIMER
>       def_bool HYPERV && (X86 || (COMPILE_TEST && ARM64))
> 
> I have no plans to touch code outside of CONFIG_PARAVIRT, i.e. outside of code
> that is explicitly x86-only, but something along those lines would help people
> like me understand the goal/intent, and in theory would also help y'all 
> maintain
> the code by detecting breakage.

RE: [PATCH 16/16] x86/kvmclock: Use TSC for sched_clock if it's constant and non-stop

Reply via email to