Re: svn commit: r238755 - head/sys/x86/x86

Bruce Evans Thu, 26 Jul 2012 05:31:24 -0700

On Thu, 26 Jul 2012, Konstantin Belousov wrote:

On Thu, Jul 26, 2012 at 05:35:23PM +1000, Bruce Evans wrote:

In fact, there is always a full documented serialization instruction
for syscalls, except maybe in FreeBSD-1 compat code on i386, at
least on Athlon64.  i386 syscalls use int 0x80 (except in FreeBSD-1
compat code they use lcalls, and the iret necessary to return from
this is serializing on at least Athlon64.  amd64 syscalls use
sysenter/sysret.  sysret isn't serializing (like far returns), at least
on Athlon64, but at least in FreeBSD, the syscall implementation uses
at least 2 swapgs's (one on entry and one just before the sysret), and
swapgs is serializing, at least on Athlon64.

Yes, SWAPGS is not documented as serializing on Intels. I reviewed


Isn't that too incompatible?

the whole syscall sequence for e.g. gettimeofday(2), and there is no
serialization point for fast path. E.g. ast would add locking and thus
serialization, as well as return by IRET, but fast path on amd64 has
no such things.

This function was moved around from time to time and now it sits here:

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=blob_plain;f=arch/x86/vdso/vclock_gettime.c

It still carries one barrier before rdtsc.  Please see the comments.


For safety, you probably need to use the slowest (cpuid) method.  Linux
seems to be just using fences that are observed to work.

No, there is explicit mention of the recommended barriers in the vendor
documentation, which is LFENCE for Intels, and MFENCE for AMDs. My patch
just follows what is suggested in documentation.


But you say later theat CPUID is needed (instead of just lock?).  The
original Athlon64 manual doesn't seem to mention MFENCE for RTDSC.
Maybe later manuals clarify that MFENCE works on old CPUs too.

[Replying to other mail in-place, the thread goes wild]


Too much quoting :-).

On Thu, Jul 26, 2012 at 04:25:01PM +1000, Bruce Evans wrote:

...
For the threaded case, there has to something for the accesses to be
provably ordered.  It is hard to see how the something can be strong
enough unless it serializes all thread state in A and B.  The rdtsc
state is not part of the thread state as know to APIs, but it is hard
to see how threads can serialize themselves without also serializing
the TSC.

TSC timer read is not synchronized, and I found the Linux test for the
thing I described above. Adopted version is available at
http://people.freebsd.org/~kib/misc/time-warp-test.c.
It shall be compiled in 32bit mode only.


My point is that it will normally be synchronized by whatever the threads
do to provide synchronization for themself.  Only the case of a single
thread doing sequential timer reads should expect the reads to be
monotonic without any explicit synchronization.  I hope this case doesn't
require stalling everything in low-level code.

On my Nehalem workstation, I get enormous amount of wraps reported for
RDTSC without CPUID. Adding CPUID back fixes the issue. So at least on
Nehalems (and probably Westmere, I will test later today) RDTSC can even
pass LOCKed instructions.


Oh, you mean with the test program, that it needs CPUID because it only
has locks and no fences and its CPUID is commented out.

Curiously enough, SandyBridge is sane and reports zero wraps, it seems
Intel fixed the bug.


The original Athlon64 manual doesn't seem to mention locks being sufficient
any more than it mentions fences.

I care about timestamps being ordered more than most people, and tried
to kill the get*time() APIs because they are weakly ordered relative
to the non-get variants (they return times in the past, and there is
no way to round down to get consistent times).  I tried to fix them
by adding locking and updating them to the latest time whenever a
non-get variant gives a later time (by being used).  This was too slow,
and breaks the design criteria that timecounter calls should not use
any explicit locking.  However, if you want slowness, then you can get
it similarly by fixing the monotonicity of rdtsc in software.  I think
I just figured out how to do this with the same slowness as serialization,
if a locked instruction serialzes; maybe less otherwise:

spin:
        ptsc = prev_tsc;        /* memory -> local (intentionally !atomic) */
        tsc = rdtsc();          /* only 32 bits for timecounters */
        if (tsc <= ptsc) {   /* I forgot about wrap at first -- see below
        */
                /*
                 * It went backwards, or stopped.  Could handle more
                 * completely, starting with panic() to see if this
                 * happens at all.
                 */
                return (ptsc);  /* stopped is better than backwards */
        }
        /* Usual case; update (32 bits). */
        if (atomic_cmpset_int(&prev_tsc, ptsc, tsc))
                return (tsc);
        goto spin;

I do not understand this. Algorithm is clear, but what you propose is
very heavy-weight comparing with adding just LFENCE or MFENCE before rdtsc.
First, the cache-line for prev_tsc becomes heavy-contended. Second, CAS
is expensive. LFENCE is fully local to the core it executes on.


I expect the contention to be rare, but then optimization isn't important
either.

But if the problem is fully local, as it apparently is for fences to
fix it, then prev_tsc can be per-CPU with a non-atomic cmpset to access
it.  We don't care if rdtsc gives an old value due to some delay in
copying the result to EDX:EAX any more than we care about an old value
due to being interrupted.  The case where we are interrupted, and
context-switched, and come back on a different CPU, is especially
interesting.  Then we may have an old tsc value from another CPU.
Sometimes we detect that it is old for the new CPU, sometimes not.
There is a problem in theory but I think none in practice.  The switch
could set a flag to tell us to loop (set prev_tsc to a sentinel value),
and it accidentally already does, except with serious our of orderness:
switches happen to always call the TSC if the TSC is usable, to maintain
the the pcpu switchtime variable.  The call for this will give a tsc
value for the new CPU that is in advance of the old one, provided there
all the TSC are in sync and there is sufficient monotonicity across
CPUs.  Then the loop will see that its tsc is old, and repeat.

I am only half serious in proposing the above.  If you want to be slow
then you can do useful work like ensuring monotonicity across all CPUs
in much the same time that is wasted by stalling the hardware until
everything is serialized.

Bruce
_______________________________________________
svn-src-all@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-all
To unsubscribe, send any mail to "svn-src-all-unsubscr...@freebsd.org"

Re: svn commit: r238755 - head/sys/x86/x86

Reply via email to