On Sat, Jun 20, 2020 at 10:02:19PM +0200, Mark Kettenis wrote:
> RDTSC is not a serializing instruction; to make sure we get the TSC
> value corresponding to the position of RDTSC in te instruction stream
> we need a barrier. Linux uses LFENCE on machines where it is
> available. FreeBSD seems to prefer MFENCE for AMD CPUs but uses
> LFENCE for Intel CPUs. For now my thinjing is that what's good enough
> for Linux should be good enough for us. And on amd64 LFENCE is always
> available.
>
> This diff reduces the scatter in the skew values. Before I had
> occasional outliers of more than 200 cycles. Now the maximem values I see
> are around 60 cycles.
>
> I din't changes the rdtsc() call that reads the timecounter. But
> maybe that one should change as well? Bit of a tradeof between
> performance and accoracy I think.
>
> This also changes the skew print message (stolen from what Theo put in
> snaps). Printing the CPU number makes it easier to get statistics for
> a specific CPU. Diff also enabled the debug message. Maybe it should
> be committed this way and then disabled again later such that we can
> get some statistics?
>
> comments? ok?
If you change the name to rdtsc_ordered(), OK.
By the way, if you want to continue in this direction you can look into
adding support for the TSC_ADJUST MSR to synchronize TSC across CPUs
as described in Section 17.17.3 from the Intel manual.
> Index: arch/amd64/amd64/tsc.c
> ===================================================================
> RCS file: /cvs/src/sys/arch/amd64/amd64/tsc.c,v
> retrieving revision 1.16
> diff -u -p -r1.16 tsc.c
> --- arch/amd64/amd64/tsc.c 6 Apr 2020 00:01:08 -0000 1.16
> +++ arch/amd64/amd64/tsc.c 20 Jun 2020 20:01:46 -0000
> @@ -100,9 +100,9 @@ get_tsc_and_timecount(struct timecounter
> int i;
>
> for (i = 0; i < RECALIBRATE_MAX_RETRIES; i++) {
> - tsc1 = rdtsc();
> + tsc1 = rdtsc_lfence();
> n = (tc->tc_get_timecount(tc) & tc->tc_counter_mask);
> - tsc2 = rdtsc();
> + tsc2 = rdtsc_lfence();
>
> if ((tsc2 - tsc1) < RECALIBRATE_SMI_THRESHOLD) {
> *count = n;
> @@ -216,8 +216,9 @@ tsc_get_timecount(struct timecounter *tc
> void
> tsc_timecounter_init(struct cpu_info *ci, uint64_t cpufreq)
> {
> +#define TSC_DEBUG
> #ifdef TSC_DEBUG
> - printf("%s: TSC skew=%lld observed drift=%lld\n", __func__,
> + printf("%s: TSC skew=%lld observed drift=%lld\n", ci->ci_dev->dv_xname,
> (long long)ci->ci_tsc_skew, (long long)tsc_drift_observed);
> #endif
>
> @@ -276,12 +277,12 @@ tsc_read_bp(struct cpu_info *ci, uint64_
>
> /* Flag it and read our TSC. */
> atomic_setbits_int(&ci->ci_flags, CPUF_SYNCTSC);
> - bptsc = (rdtsc() >> 1);
> + bptsc = (rdtsc_lfence() >> 1);
>
> /* Wait for remote to complete, and read ours again. */
> while ((ci->ci_flags & CPUF_SYNCTSC) != 0)
> membar_consumer();
> - bptsc += (rdtsc() >> 1);
> + bptsc += (rdtsc_lfence() >> 1);
>
> /* Wait for the results to come in. */
> while (tsc_sync_cpu == ci)
> @@ -317,11 +318,11 @@ tsc_post_ap(struct cpu_info *ci)
> /* Wait for go-ahead from primary. */
> while ((ci->ci_flags & CPUF_SYNCTSC) == 0)
> membar_consumer();
> - tsc = (rdtsc() >> 1);
> + tsc = (rdtsc_lfence() >> 1);
>
> /* Instruct primary to read its counter. */
> atomic_clearbits_int(&ci->ci_flags, CPUF_SYNCTSC);
> - tsc += (rdtsc() >> 1);
> + tsc += (rdtsc_lfence() >> 1);
>
> /* Post result. Ensure the whole value goes out atomically. */
> (void)atomic_swap_64(&tsc_sync_val, tsc);
> Index: arch/amd64/include/cpufunc.h
> ===================================================================
> RCS file: /cvs/src/sys/arch/amd64/include/cpufunc.h,v
> retrieving revision 1.34
> diff -u -p -r1.34 cpufunc.h
> --- arch/amd64/include/cpufunc.h 28 Jun 2019 21:54:05 -0000 1.34
> +++ arch/amd64/include/cpufunc.h 20 Jun 2020 20:01:46 -0000
> @@ -292,6 +292,15 @@ rdtsc(void)
> }
>
> static __inline u_int64_t
> +rdtsc_lfence(void)
> +{
> + uint32_t hi, lo;
> +
> + __asm volatile("lfence; rdtsc" : "=d" (hi), "=a" (lo));
> + return (((uint64_t)hi << 32) | (uint64_t) lo);
> +}
> +
> +static __inline u_int64_t
> rdpmc(u_int pmc)
> {
> uint32_t hi, lo;