On Tue, May 28, 2013 at 01:35:01PM -0700, Alfred Perlstein wrote: > [[ moved to hackers@ from private mail. ]] > > On 5/28/13 1:13 PM, John Baldwin wrote: > > On Tuesday, May 28, 2013 3:29:41 pm Alfred Perlstein wrote: > >> On 5/28/13 9:04 AM, John Baldwin wrote: > >>> On Tuesday, May 28, 2013 2:13:32 am Alfred Perlstein wrote: > >>>> Hey folks, > >>>> > >>>> I had a talk with Nathan Whitehorn about the camisr issue. The issue we > >>>> are seeing we mostly know, but to summarize, we are losing the camisr > >>>> signal and the camisr is not being run. > >>>> > >>>> I gave him a summary of what we have been seeing and pointed him to the > >>>> code I am concerned about here: > >>>> http://pastebin.com/tLKr7mCV (this is inside of kern_intr.c). > >>>> > >>>> What I think that is happening is that the setting of it_need to 0 > >>>> inside of sys/kern/kern_intr.c:ithread_loop() is not being scheduled > >>>> correctly and it is being delayed until AFTER the call to > >>>> ithread_execute_handlers() right below the atomic_store_rel_int(). > >>> This seems highly unlikely, to the extent that if this were true all our > >>> locking primitives would be broken. The store_rel is actually a release > >>> barrier which acts like more a read/write fence. No memory accesses > >>> (read or > >>> write) from before the release can be scheduled after the associated > >>> store, > >>> either by the compiler or CPU. That is what Konstantin is referring to > >>> in his > >>> commit when he says "release" semantics". > >> Yes, that makes sense, however does it specify that the writes *must* > >> occur at that *point*? If it only enforces ordering then we may have > >> some issue, specifically because the setting of it to '1' inside of > >> intr_event_schedule_thread has no barrier other than the acq semantics > >> of the thread lock. I am wondering what is forcing out the '1' there. > > Nothing ever forces writes. You would have to invalidate the cache to do > > that > > and that is horribly expensive. It is always only about ordering and > > knowing > > that if you can complete another operation on the same "cookie" variable > > with > > acquire semantics that earlier writes will be visible. > > By cookie, you mean a specific memory address, basically a lock? This is > starting to reinforce my suspicions as the setting of it_need is done > with release semantics, however the acq on the other CPU is done on the > thread lock. Maybe that is irrelevant. We will find out shortly. > > > > >> See below as I think we have proof that this is somehow happening. > > Having ih_need of 1 and it_need of 0 is certainly busted. The simplest fix > > is probably to stop using atomics on it_need and just grab the thread lock > > in the main ithread loop and hold it while checking and clearing it_need. > > > > OK, we have some code that will either prove this, or perturb the memory > ordering enough to make the bug go away, or prove this assertion wrong. > > We will update on our findings hopefully in the next few days.
IMO the read of it_need in the 'while (ithd->it_need)' should have acquire semantic, otherwise the future reads in the ithread_execute_handlers(), in particular, of the ih_need, could pass the read of it_need and cause the situation you reported. I do not see any acquire barrier between a condition in the while() statement and the read of ih_need in the execute_handlers(). It is probably true that the issue you see was caused by r236456, in the sense that implicitely locked xchgl instruction on x86 has a full barrier semantic. As result, the store_rel() was actually an acquire too, making this reordering impossible. I argue that this is not a bug in r236456, but the issue in the kern_intr.c. On the other hand, the John' suggestion to move the manipulations of it_need under the lock is probably the best anyway.
pgpL971MsQjuU.pgp
Description: PGP signature