----- Original Message ----- > From: "Peter Zijlstra" <pet...@infradead.org> > To: "Mathieu Desnoyers" <mathieu.desnoy...@efficios.com> > Cc: linux-kernel@vger.kernel.org, "KOSAKI Motohiro" > <kosaki.motoh...@jp.fujitsu.com>, "Steven Rostedt" > <rost...@goodmis.org>, "Paul E. McKenney" <paul...@linux.vnet.ibm.com>, > "Nicholas Miell" <nmi...@comcast.net>, > "Linus Torvalds" <torva...@linux-foundation.org>, "Ingo Molnar" > <mi...@redhat.com>, "Alan Cox" > <gno...@lxorguk.ukuu.org.uk>, "Lai Jiangshan" <la...@cn.fujitsu.com>, > "Stephen Hemminger" > <step...@networkplumber.org>, "Andrew Morton" <a...@linux-foundation.org>, > "Josh Triplett" <j...@joshtriplett.org>, > "Thomas Gleixner" <t...@linutronix.de>, "David Howells" > <dhowe...@redhat.com>, "Nick Piggin" <npig...@kernel.dk> > Sent: Monday, March 16, 2015 4:54:35 PM > Subject: Re: [RFC PATCH] sys_membarrier(): system/process-wide memory barrier > (x86) (v12) > > On Mon, Mar 16, 2015 at 06:53:35PM +0000, Mathieu Desnoyers wrote: > > > I'm not entirely awake atm but I'm not seeing why it would need to be > > > that strict; I think the current single MB on task switch is sufficient > > > because if we're in the middle of schedule, userspace isn't actually > > > running. > > > > > > So from the point of userspace the task switch is atomic. Therefore even > > > if we do not get a barrier before setting ->curr, the expedited thing > > > missing us doesn't matter as userspace cannot observe the difference. > > > > AFAIU, atomicity is not what matters here. It's more about memory ordering. > > What is guaranteeing that upon entry in kernel-space, all prior memory > > accesses (loads and stores) are ordered prior to following loads/stores ? > > > > The same applies when returning to user-space: what is guaranteeing that > > all > > prior loads/stores are ordered before the user-space loads/stores performed > > after returning to user-space ? > > You're still one step ahead of me; why does this matter? > > Or put it another way; what can go wrong? By virtue of being in > schedule() both tasks (prev and next) get an affective MB from the task > switch. > > So even if we see the 'wrong' rq->curr, that CPU will still observe the > MB by the time it gets to userspace. > > All of this is really only about userspace load/store ordering and the > context switch already very much needs to guarantee userspace program > order in the face of context switches.
Let's go through a memory ordering scenario to highlight my reasoning there. Let's consider the following memory barrier scenario performed in user-space on an architecture with very relaxed ordering. PowerPC comes to mind. https://lwn.net/Articles/573436/ scenario 12: CPU 0 CPU 1 CAO(x) = 1; r3 = CAO(y); cmm_smp_wmb(); cmm_smp_rmb(); CAO(y) = 1; r4 = CAO(x); BUG_ON(r3 == 1 && r4 == 0) We tweak it to use sys_membarrier on CPU 1, and a simple compiler barrier() on CPU 0: CPU 0 CPU 1 CAO(x) = 1; r3 = CAO(y); barrier(); sys_membarrier(); CAO(y) = 1; r4 = CAO(x); BUG_ON(r3 == 1 && r4 == 0) Now if CPU 1 executes sys_membarrier while CPU 0 is preempted after both stores, we have: CPU 0 CPU 1 CAO(x) = 1; [1st store is slow to reach other cores] CAO(y) = 1; [2nd store reaches other cores more quickly] [preempted] r3 = CAO(y) (may see y = 1) sys_membarrier() Scheduler changes rq->curr. skips CPU 0, because rq->curr has been updated. [return to userspace] r4 = CAO(x) (may see x = 0) BUG_ON(r3 == 1 && r4 == 0) -> fails. load_cr3, with implied memory barrier, comes after CPU 1 has read "x". The only way to make this scenario work is if a memory barrier is added before updating rq->curr. (we could also do a similar scenario for the needed barrier after store to rq->curr). > > > > > In order to be able to dereference rq->curr->mm without holding the > > > > rq->lock, do you envision we should protect task reclaim with RCU-sched > > > > ? > > > > > > A recent discussion had Linus suggest SLAB_DESTROY_BY_RCU, although I > > > think Oleg did mention it would still be 'interesting'. I've not yet had > > > time to really think about that. > > > > This might be an "interesting" modification. :) This could perhaps come > > as an optimization later on ? > > Not really, again, take this for (;;) sys_membar(EXPEDITED) that'll > generate horrendous rq lock contention, with or without the PRIVATE > thing it'll pound a number of rq locks real bad. > > Typical scheduler syscalls only affect a single rq lock at a time -- the > one the task is on. This one potentially pounds all of them. Would you see it as acceptable if we start by implementing only the non-expedited sys_membarrier() ? Then we can add the expedited-private implementation after rq->curr becomes available through RCU. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/