On 3/11/20 9:16 PM, Mateusz Guzik wrote:
On 3/10/20, Konstantin Belousov <kostik...@gmail.com> wrote:
On Tue, Mar 10, 2020 at 12:22:09AM +0100, Mateusz Guzik wrote:
On 3/9/20, Konstantin Belousov <kostik...@gmail.com> wrote:
On Mon, Mar 09, 2020 at 01:56:17AM +0100, Mateusz Guzik wrote:
On 3/8/20, Mateusz Guzik <m...@freebsd.org> wrote:
Author: mjg
Date: Sun Mar 8 00:22:32 2020
New Revision: 358733
URL: https://svnweb.freebsd.org/changeset/base/358733
Log:
seqc: tidy up
- avoid hand-rolled read
- match begin/end in terms of fence style
There were off lists questions about this so let me clarify.
The first bit is a cosmetic change, but the second one is not.
Modified:
head/sys/sys/seqc.h
Modified: head/sys/sys/seqc.h
==============================================================================
--- head/sys/sys/seqc.h Sat Mar 7 15:37:23 2020 (r358732)
+++ head/sys/sys/seqc.h Sun Mar 8 00:22:32 2020 (r358733)
@@ -66,7 +66,8 @@ static __inline void
seqc_write_end(seqc_t *seqcp)
{
- atomic_store_rel_int(seqcp, *seqcp + 1);
+ atomic_thread_fence_rel();
+ *seqcp += 1;
MPASS(!seqc_in_modify(*seqcp));
critical_exit();
}
For correct operation the counter has to be modified *before* any work
is done and *after* it is completed.
Consider a trivial case:
seqc++;
foo[0] = 0;
foo[1] = 1;
seqc++;
There are 2 ways in which this can be mucked with:
- the compiler can be looking at reordering or reworking the increment
(e.g., to collapse it to just += 2 instead). a compiler barrier
prevents that.
- even with generated machine code which increments in correct places
the cpu can be looking at reordering the operation (which is heavily
dependent on architecture), in particular it could end up issuing
seqc++; foo[1] = 1; which would defeat the point. This is where
release fences come in.
With the patched code there is:
seqc++;
atomic_thread_fence_rel(); /* make sure seqc++ is issued before foo
starts getting modified */
foo[0] = 0;
foo[1] = 1;
atomic_thread_fence_rel(); /* make sure modifications to foo don't
leak past this spot */
seqc++;
In comparison, the previous code was:
seqc++;
atomic_thread_fence_rel(); /* make sure seqc++ is issued before foo
starts getting modified */
foo[0] = 0;
foo[1] = 1;
atomic_store_rel_int(seqcp, *seqcp + 1); /* make sure modifications to
too don't leak past this spot and update seqc immediately */
There are 2 differences here -- one is an improvement and second one
does not matter:
the win is: the previous code forces the compiler to compute an
incremented value and then store it. On amd64 this translates to the
following (with rdx holding the address of the counter):
mov (%rdx),%eax /* load the value */
add $0x1,%eax /* add 1 */
mov %eax,(%rdx) /* store it */
On patched kernel this is:
addl $0x1,(%rdx)
which is clearly much nicer.
I am not sure that the new code on amd64 is much nicer.
But the point was that on non-amd64, i.e. armv8 and potentially on
powerpc 3.0, the patch substitutes release store with full barrier. The
later is (much) slower.
If an arch performs something significantly more expensive here it's
probably a performance bug in the port.
It is a question of how much hardware support for fine-grained fences.
Relaxed architectures add some optimized operations already, but did not
(yet) implemented complete C11 optimized set.
I think they eventually converge.
Until they did not, I think it is more important to not pessimize
to-be Tier1 arches than to collapse three fast integer instructions
on amd64 into one rw op.
But perhaps more importantly
there are significantly more frequent users of
atomic_thread_fence_rel, like refcount(9). If the issue is real that
should be looked at.
For refcount(9) use of release fence removal, we would need
atomic_fetchadd_rel_int().
To reiterate, if atomic_thread_fence_rel is indeed very expensive on
certain architectures compared to other _rel variants, the kernel is
already heavily pessimized because of use of said standalone fence in
refcount(9) and few other places and this needs to be fixed.
Just to be clear, this is not just a matter of whether our release
fences are poorly or inefficiently implemented. Release fences (as
defined by the C/C++ standards that inspire our atomics) have
*different* semantics than release stores. Specifically, a release fence
places more ordering restrictions on the nearby store accesses than a
release store does. (In fact, "man 9 atomic" describes this.) Thus, it
is almost inevitable that a release fence will have a higher cost than
the best possible implementation of a release store.
If the cost can't be reduced (e.g., perhaps this can do a no-op store
+ fence somewhere akin to how lock xadd 0,(%esp) trickery used to
work), then the atomic APIs do indeed not to be completed w.r.t.
adding all the missing acq/rel variants (that's at least fetchadd and
swap).
We can also extend the API to do relevant addition/subtraction without
guaranteeing atomicity vs concurrent updates, but while providing
relevant fences. This would preserve the intended state achieved with
this commit while not interfering with others.
I can add the missing amd64 bits if other people cover their
architectures, but otherwise I'm not interested in doing anything
here.
seqc itself, apart from being a minor consumer compared to refcount,
was already using the standalone fence even prior to this change thus
I don't think there is any rush to do anything with regards to this
particular commit.
Indeed, it was. It was using a release fence where it was semantically
required to do so for *correctness*. Specifically, where the extra
ordering restrictions of a release fence were needed. But, it was also
using the simpler release store where the extra ordering restrictions
were not needed.
the not a problem: the code does not issue the release fence after
incrementing the counter the second time, meaning that in principle it
can happen arbitrarily late, possibly disturbing code which waits for
the counter to become even. This is not a problem because in whoever
modifies it is supposed to have a lock and release it shortly after,
which also posts the release fence and consequently makes sure the
counter update is visible. Also note these routines come with
preemption disablement around the modification -- if the thread gets
preempted as a result of critical_exit from seqc_write_end, it will
also publish the updated counter. If the updater never does anything
which flushes the store buffer then in the absolute worst case they
will get preempted, at which point once more we are covered. Note that
code which wants more deterministic behavior will want to
seqc_read_any, test the counter once and if it is caught uneven resort
to locking.
@@ -84,7 +85,7 @@ seqc_read(const seqc_t *seqcp)
seqc_t ret;
for (;;) {
- ret = atomic_load_acq_int(__DECONST(seqc_t *, seqcp));
+ ret = seqc_read_any(seqcp);
if (__predict_false(seqc_in_modify(ret))) {
cpu_spinwait();
continue;
_______________________________________________
svn-src-...@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/svn-src-all
To unsubscribe, send any mail to
"svn-src-all-unsubscr...@freebsd.org"
--
Mateusz Guzik <mjguzik gmail.com>
--
Mateusz Guzik <mjguzik gmail.com>
_______________________________________________
svn-src-head@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to "svn-src-head-unsubscr...@freebsd.org"