On Thu, Dec 10, 2020 at 9:16 AM Andy Lutomirski <l...@amacapital.net> wrote: > > > > > On Dec 10, 2020, at 6:52 AM, Maxim Levitsky <mlevi...@redhat.com> wrote: > > > > On Thu, 2020-12-10 at 12:48 +0100, Paolo Bonzini wrote: > >>> On 08/12/20 22:20, Thomas Gleixner wrote: > >>> So now life migration comes a long time after timekeeping had set the > >>> limits and just because it's virt it expects that everything works and it > >>> just can ignore these limits. > >>> > >>> TBH. That's not any different than SMM or hard/firmware taking the > >>> machine out for lunch. It's exactly the same: It's broken. > >> > >> I agree. If *live* migration stops the VM for 200 seconds, it's broken. > >> > >> Sure, there's the case of snapshotting the VM over the weekend. My > >> favorite solution would be to just put it in S3 before doing that. *Do > >> what bare metal does* and you can't go that wrong. > > > > Note though that qemu has a couple of issues with s3, and it is disabled > > by default in libvirt. > > I would be very happy to work on improving this if there is a need for that. > > There’s also the case where someone has a VM running on a laptop and someone > closes the lid. The host QEMU might not have a chance to convince the guest > to enter S3. > > > > > > >> > >> In general it's userspace policy whether to keep the TSC value the same > >> across live migration. There's pros and cons to both approaches, so KVM > >> should provide the functionality to keep the TSC running (which the > >> guest will see as a very long, but not extreme SMI), and this is what > >> this series does. Maxim will change it to operate per-VM. Thanks > >> Thomas, Oliver and everyone else for the input.
So, to be clear, this per-VM ioctl will work something like the following: static u64 kvm_read_tsc_base(struct kvm *kvm, u64 host_tsc) { return kvm_scale_tsc(kvm, host_tsc) + kvm->host_tsc_offset; } case KVM_GET_TSC_BASE: struct kvm_tsc_base base = { .flags = KVM_TSC_BASE_TIMESTAMP_VALID; }; u64 host_tsc; kvm_get_walltime(&base.nsec, &host_tsc); base.tsc = kvm_read_tsc_base(kvm, host_tsc); copy_to_user(...); [...] case KVM_SET_TSC_BASE: struct kvm_tsc_base base; u64 host_tsc, nsec; s64 delta = 0; copy_from_user(...); kvm_get_walltime(&nsec, &host_tsc); delta += base.tsc - kvm_read_tsc_base(kvm, host_tsc); if (base.flags & KVM_TSC_BASE_TIMESTAMP_VALID) { u64 delta_nsec = nsec - base.nsec; if (delta_nsec > 0) delta += nsec_to_cycles(kvm, diff); else delta -= nsec_to_cycles(kvm, -diff); } kvm->host_tsc_offset += delta; /* plumb host_tsc_offset through to each vcpu */ However, I don't believe we can assume the guest's TSCs to be synchronized, even if sane guests will never touch them. In this case, I think a per-vCPU ioctl is still warranted, allowing userspace to get at the guest CPU adjust component of Thomas' equation below (paraphrased): TSC guest CPU = host tsc base + guest base offset + guest CPU adjust Alternatively, a write from userspace to the guest's IA32_TSC_ADJUST with KVM_X86_QUIRK_TSC_HOST_ACCESS could have the same effect, but that seems to be problematic for a couple reasons. First, depending on the guest's CPUID the TSC_ADJUST MSR may not even be available, meaning that the guest could've used IA32_TSC to adjust the TSC (eww). Second, userspace replaying writes to IA32_TSC (in the case IA32_TSC_ADJUST doesn't exist for the guest) seems _very_ unlikely to work given all the magic handling that KVM does for writes to it. Is this roughly where we are or have I entirely missed the mark? :-) -- Thanks, Oliver > > > > I agree with that. > > > > I still think though that we should have a discussion on feasibility > > of making the kernel time code deal with large *forward* tsc jumps > > without crashing. > > > > If that is indeed hard to do, or will cause performance issues, > > then I agree that we might indeed inform the guest of time jumps instead. > > > > Tglx, even without fancy shared host/guest timekeeping, count the guest > kernel manage to update its timekeeping if the host sent the guest an > interrupt or NMI on all CPUs synchronously on resume? > > Alternatively, if we had the explicit “max TSC value that makes sense right > now” in the timekeeping data, the guest would reliably notice the large jump > and could at least do something intelligent about it instead of overflowing > its internal calculation.