On Thu, Dec 10, 2020 at 9:16 AM Andy Lutomirski <l...@amacapital.net> wrote:
>
>
>
> > On Dec 10, 2020, at 6:52 AM, Maxim Levitsky <mlevi...@redhat.com> wrote:
> >
> > On Thu, 2020-12-10 at 12:48 +0100, Paolo Bonzini wrote:
> >>> On 08/12/20 22:20, Thomas Gleixner wrote:
> >>> So now life migration comes a long time after timekeeping had set the
> >>> limits and just because it's virt it expects that everything works and it
> >>> just can ignore these limits.
> >>>
> >>> TBH. That's not any different than SMM or hard/firmware taking the
> >>> machine out for lunch. It's exactly the same: It's broken.
> >>
> >> I agree.  If *live* migration stops the VM for 200 seconds, it's broken.
> >>
> >> Sure, there's the case of snapshotting the VM over the weekend.  My
> >> favorite solution would be to just put it in S3 before doing that.  *Do
> >> what bare metal does* and you can't go that wrong.
> >
> > Note though that qemu has a couple of issues with s3, and it is disabled
> > by default in libvirt.
> > I would be very happy to work on improving this if there is a need for that.
>
> There’s also the case where someone has a VM running on a laptop and someone 
> closes the lid. The host QEMU might not have a chance to convince the guest 
> to enter S3.
>
> >
> >
> >>
> >> In general it's userspace policy whether to keep the TSC value the same
> >> across live migration.  There's pros and cons to both approaches, so KVM
> >> should provide the functionality to keep the TSC running (which the
> >> guest will see as a very long, but not extreme SMI), and this is what
> >> this series does.  Maxim will change it to operate per-VM.  Thanks
> >> Thomas, Oliver and everyone else for the input.

So, to be clear, this per-VM ioctl will work something like the following:

static u64 kvm_read_tsc_base(struct kvm *kvm, u64 host_tsc)
{
        return kvm_scale_tsc(kvm, host_tsc) + kvm->host_tsc_offset;
}

case KVM_GET_TSC_BASE:
        struct kvm_tsc_base base = {
                .flags = KVM_TSC_BASE_TIMESTAMP_VALID;
        };
        u64 host_tsc;

        kvm_get_walltime(&base.nsec, &host_tsc);
        base.tsc = kvm_read_tsc_base(kvm, host_tsc);

        copy_to_user(...);

[...]

case KVM_SET_TSC_BASE:
        struct kvm_tsc_base base;
        u64 host_tsc, nsec;
        s64 delta = 0;

        copy_from_user(...);

        kvm_get_walltime(&nsec, &host_tsc);
        delta += base.tsc - kvm_read_tsc_base(kvm, host_tsc);

        if (base.flags & KVM_TSC_BASE_TIMESTAMP_VALID) {
                u64 delta_nsec = nsec - base.nsec;

                if (delta_nsec > 0)
                        delta += nsec_to_cycles(kvm, diff);
                else
                        delta -= nsec_to_cycles(kvm, -diff);
        }

        kvm->host_tsc_offset += delta;
        /* plumb host_tsc_offset through to each vcpu */

However, I don't believe we can assume the guest's TSCs to be synchronized,
even if sane guests will never touch them. In this case, I think a per-vCPU
ioctl is still warranted, allowing userspace to get at the guest CPU adjust
component of Thomas' equation below (paraphrased):

        TSC guest CPU = host tsc base + guest base offset + guest CPU adjust

Alternatively, a write from userspace to the guest's IA32_TSC_ADJUST with
KVM_X86_QUIRK_TSC_HOST_ACCESS could have the same effect, but that seems to be
problematic for a couple reasons. First, depending on the guest's CPUID the
TSC_ADJUST MSR may not even be available, meaning that the guest could've used
IA32_TSC to adjust the TSC (eww). Second, userspace replaying writes to IA32_TSC
(in the case IA32_TSC_ADJUST doesn't exist for the guest) seems _very_
unlikely to work given all the magic handling that KVM does for
writes to it.

Is this roughly where we are or have I entirely missed the mark? :-)

--
Thanks,
Oliver

> >
> > I agree with that.
> >
> > I still think though that we should have a discussion on feasibility
> > of making the kernel time code deal with large *forward* tsc jumps
> > without crashing.
> >
> > If that is indeed hard to do, or will cause performance issues,
> > then I agree that we might indeed inform the guest of time jumps instead.
> >
>
> Tglx, even without fancy shared host/guest timekeeping, count the guest 
> kernel manage to update its timekeeping if the host sent the guest an 
> interrupt or NMI on all CPUs synchronously on resume?
>
> Alternatively, if we had the explicit “max TSC value that makes sense right 
> now” in the timekeeping data, the guest would reliably notice the large jump 
> and could at least do something intelligent about it instead of overflowing 
> its internal calculation.

Reply via email to