On Mon, Sep 17, 2012 at 5:20 PM, John Stultz <john.stu...@linaro.org> wrote: > On 09/17/2012 04:49 PM, Andy Lutomirski wrote: >> >> On Mon, Sep 17, 2012 at 3:04 PM, John Stultz <john.stu...@linaro.org> >> wrote:
>>> >>> Anyway, I'd greatly appreciate any thoughts or feedback on this >>> approach. >> >> I haven't looked in any great detail, but the approach looks sensible >> and should slow down the vsyscall code. > > > Did you mean "shouldn't"? Or are you concerned about something > specifically? > > I know you've done quite a bit of tuning on the x86 vdso side, and I don't > want to wreck that, so I'd appreciate any specific thoughts you have if you > get a chance to look at it. I meant "shouldn't". The thing I use for testing is: https://gitorious.org/linux-test-utils/linux-clock-tests > > >> That being said, as long as you're playing with this, here are a >> couple thoughts: >> >> 1. The TSC-reading code does this: >> >> ret = (cycle_t)vget_cycles(); >> >> last = VVAR(vsyscall_gtod_data).clock.cycle_last; >> >> if (likely(ret >= last)) >> return ret; >> >> I haven't specifically benchmarked the cost of that branch, but I >> suspect it's a fairly large fraction of the total cost of >> vclock_gettime. IIUC, the point is that there might be a few cycles >> worth of clock skew even on systems with otherwise usable TSCs, and we >> don't want a different CPU to return complete garbage if the cycle >> count is just below cycle_last. >> >> A different formulation would avoid the problem: set cycle_last to, >> say, 100ms *before* the time of the last update_vsyscall, and adjust >> the wall_time, etc variables accordingly. That way a few cycles (or >> anything up to 100ms) or skew won't cause an overflow. Then you could >> kill that branch. > > Interesting. So I want to keep the scope of this patch set in check, so I'd > probably hold off on something like this till later, but this might not be > too complicated to do in the update_wall_time() function, basically delaying > the accumulation by some amount of time Although this would have odd > effects on things like filesystem timestamps, which provide "the time at the > last tick", which would then be "the time at the last tick + Xms". So it > probably needs more careful consideration. Fair enough. One way to do this cleanly might be to have helpers that effectively read CLOCK_REALTIME_COARSE and use those instead of xtime. > > > > >> 2. There's nothing vsyscall-specific about the code in >> vclock_gettime.c. In fact, the VVAR macro should work just fine in >> kernel code. If you moved all this code into a header, then in-kernel >> uses could use it, and maybe even other arches could use it. Last >> time I checked, it seemed like vclock_gettime was considerably faster >> than whatever the in-kernel equivalent did. > > I like the idea of unifying the implementations, but I'd want to know more > about why vclock_gettime was faster then the in-kernel getnstimeofday(), > since it might be due to the more limited locking (we only update vsyscall > data under the vsyscall lock, where as the timekeeper lock is held for the > entire execution of update_wall_time()), or some of the optimizations in the > vsyscall code is focused on providing timespecs to userland, where as > in-kernel we also have to provide ktime_ts. > > If you have any details here, I'd love to hear more. There is some work I > have planned to address some of these differences, but I'm not sure when > I'll eventually get to it. > I suspect it's because everyone who's benchmarked and tuned the "what time is it?" code has focused on the vdso version, so the in-kernel version hasn't gotten much love. The vclock_gettime code has basically no abstractions and uses only direct (i.e. explicitly devirtualized) calls. ktime_get, OTOH, uses clock->read, which, in the best case, is read_tsc. read_tsc (in arch/x86/kernel/tsc.c) is missing the don't-use-cmov optimization. The vclock_gettime code also produces a struct timespec without div/mod, which is probably impossible for any code that goes through ktime. In the other direction, AFAICS the in-kernel code is completely missing the rdtsc_barrier call, which is empirically rather necessary. It's pretty easy to detect clock skew without it. > > >> 3. The mul_u64_u32_shr function [1] might show up soon, and it would >> potentially allow much longer intervals between timekeeping updates. >> I'm not sure whether the sub-ns formuation would still work, though (I >> suspect it would with some care). >> > Yea, we already have a number of calculations to try to maximize the > interval while still allowing for fine tuned adjustments. Even so, we are > somewhat bound by the need to provide tick-granular timestamps for > filesystem code, etc. So it may be limited to extending idle times out > until folks are ok with either going with most expensive clock-granular > timestamps for filesystem code, or loosing the granularity needs somewhat. Maybe this could depend on which clocksource is in use. On recent Intel boxen, the entire vclock_gettime call takes under 20 ns. When called from the kernel, it could be even faster if static keys (or whatever they're called these days) were used. FWIW, I don't understand the NTP code at all. I'm pretty sure I know what a PLL is, and I don't understand how that has anything to do with NTP (although it makes sense in a PPS context). I've always wondered why the kernel's idea of time was any more complicated than "here's the clock frequency; in addition, slew the time by X ns (or sub-ns) over the next Y ns" -- some driver or even userspace code could program that and the kernel could wake up when it needed to for the update. (To make this seamless, vclock_gettime would either need a branch or the kernel would need to wake up a little early and update the vsyscall data.) --Andy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/