The vsyscall based timekeeping interfaces for userspace provide the shortest possible reader side blocking (update of the vsyscall gtod data structure), but the kernel side interfaces to timekeeping are blocked over the full code sequence of calculating update_wall_time() magic which can be rather "long" due to ntp, corner cases, etc...
Eric did some work a few years ago to distangle the seqcount write hold from the spinlock which is serializing the potential updaters of the kernel internal timekeeper data. I couldn't be bothered to reread the old mail thread and figure out why this got turned down, but I remember that there were objections due to the potential inconsistency between calculation, update and observation. In hindsight that's nonsense, because even back at that time we did the vsyscall update at the very least moment and unsychronized to the in kernel data update. While we never got any complaints about that there is a real issue versus virtualization: VCPU0 VCPU1 update_wall_time() write_seqlock_irqsave(&tk->lock, flags); .... Host schedules out VCPU0 Arbitrary delay Host schedules in VCPU0 __vdso_clock_gettime()#1 update_vsyscall(); __vdso_clock_gettime()#2 Depending on the length of the delay which kept VCPU0 away from executing and depending on the direction of the ntp update of the timekeeping variables __vdso_clock_gettime()#2 can observe time going backwards. You can reproduce that by pinning VCPU0 to physical core 0 and VCPU1 to physical core 1. Now remove all load from physical core 1 except VCPU1 and put massive load on physical core 0 and make sure that the NTP adjustment lowers the mult factor. It's extremly hard to reproduce, but it's possible. So this patch series is going to expose the same issue to the kernel side timekeeping. I'm not too worried about that, because - it's extremly hard to trigger - we are aware of the issue vs. vsyscalls already - making the kernel behave the same way as vsyscall does not make things worse - John Stultz has already an idea how to fix it. See https://lkml.org/lkml/2013/2/19/569 Though that's not the scope of this patch series, but I want to make sure that it's documented. Now the obvious question whether this is worth the trouble can be answered easily. Preempt-RT users and HPC folks have complained about the long write hold time of the timekeeping seqcount since years and a quick test on a preempt-RT enabled kernel shows, that this series lowers the maximum latency on the non-timekeeping cores from 8 to 4 microseconds. That's a whopping factor of 2. Defintely worth the trouble! Thanks, tglx --- include/linux/jiffies.h | 1 include/linux/timekeeper_internal.h | 4 kernel/time/tick-internal.h | 2 kernel/time/timekeeping.c | 176 +++++++++++++++++++++--------------- 4 files changed, 107 insertions(+), 76 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/