On Wed, May 29, 2013 at 12:27:46AM -0700, Alfred Perlstein wrote: > On 5/29/13 12:16 AM, Konstantin Belousov wrote: > > On Tue, May 28, 2013 at 10:31:40PM -0700, Alfred Perlstein wrote: > >> On 5/28/13 10:08 PM, Konstantin Belousov wrote: > >>> On Tue, May 28, 2013 at 01:35:01PM -0700, Alfred Perlstein wrote: > >>>> [[ moved to hackers@ from private mail. ]] > >>>> > >>>> On 5/28/13 1:13 PM, John Baldwin wrote: > >>>>> On Tuesday, May 28, 2013 3:29:41 pm Alfred Perlstein wrote: > >>>>>> On 5/28/13 9:04 AM, John Baldwin wrote: > >>>>>>> On Tuesday, May 28, 2013 2:13:32 am Alfred Perlstein wrote: > >>>>>>>> Hey folks, > >>>>>>>> > >>>>>>>> I had a talk with Nathan Whitehorn about the camisr issue. The > >>>>>>>> issue we > >>>>>>>> are seeing we mostly know, but to summarize, we are losing the camisr > >>>>>>>> signal and the camisr is not being run. > >>>>>>>> > >>>>>>>> I gave him a summary of what we have been seeing and pointed him to > >>>>>>>> the > >>>>>>>> code I am concerned about here: > >>>>>>>> http://pastebin.com/tLKr7mCV (this is inside of kern_intr.c). > >>>>>>>> > >>>>>>>> What I think that is happening is that the setting of it_need to 0 > >>>>>>>> inside of sys/kern/kern_intr.c:ithread_loop() is not being scheduled > >>>>>>>> correctly and it is being delayed until AFTER the call to > >>>>>>>> ithread_execute_handlers() right below the atomic_store_rel_int(). > >>>>>>> This seems highly unlikely, to the extent that if this were true all > >>>>>>> our > >>>>>>> locking primitives would be broken. The store_rel is actually a > >>>>>>> release > >>>>>>> barrier which acts like more a read/write fence. No memory accesses > >>>>>>> (read or > >>>>>>> write) from before the release can be scheduled after the associated > >>>>>>> store, > >>>>>>> either by the compiler or CPU. That is what Konstantin is referring > >>>>>>> to in his > >>>>>>> commit when he says "release" semantics". > >>>>>> Yes, that makes sense, however does it specify that the writes *must* > >>>>>> occur at that *point*? If it only enforces ordering then we may have > >>>>>> some issue, specifically because the setting of it to '1' inside of > >>>>>> intr_event_schedule_thread has no barrier other than the acq semantics > >>>>>> of the thread lock. I am wondering what is forcing out the '1' there. > >>>>> Nothing ever forces writes. You would have to invalidate the cache to > >>>>> do that > >>>>> and that is horribly expensive. It is always only about ordering and > >>>>> knowing > >>>>> that if you can complete another operation on the same "cookie" > >>>>> variable with > >>>>> acquire semantics that earlier writes will be visible. > >>>> By cookie, you mean a specific memory address, basically a lock? This is > >>>> starting to reinforce my suspicions as the setting of it_need is done > >>>> with release semantics, however the acq on the other CPU is done on the > >>>> thread lock. Maybe that is irrelevant. We will find out shortly. > >>>> > >>>>>> See below as I think we have proof that this is somehow happening. > >>>>> Having ih_need of 1 and it_need of 0 is certainly busted. The simplest > >>>>> fix > >>>>> is probably to stop using atomics on it_need and just grab the thread > >>>>> lock > >>>>> in the main ithread loop and hold it while checking and clearing > >>>>> it_need. > >>>>> > >>>> OK, we have some code that will either prove this, or perturb the memory > >>>> ordering enough to make the bug go away, or prove this assertion wrong. > >>>> > >>>> We will update on our findings hopefully in the next few days. > >>> IMO the read of it_need in the 'while (ithd->it_need)' should > >>> have acquire semantic, otherwise the future reads in the > >>> ithread_execute_handlers(), in particular, of the ih_need, could pass > >>> the read of it_need and cause the situation you reported. I do not > >>> see any acquire barrier between a condition in the while() statement > >>> and the read of ih_need in the execute_handlers(). > >>> > >>> It is probably true that the issue you see was caused by r236456, in the > >>> sense that implicitely locked xchgl instruction on x86 has a full barrier > >>> semantic. As result, the store_rel() was actually an acquire too, making > >>> this reordering impossible. I argue that this is not a bug in r236456, > >>> but the issue in the kern_intr.c. > >> If I remember the code correctly that would probably explain why we see > >> it only on 9.1 system. > >>> On the other hand, the John' suggestion to move the manipulations of > >>> it_need under the lock is probably the best anyway. > >>> > >> I was wondering if it would be lower latency to maintain it_need, > >> however to keep another variable it_needlocked under the thread lock. > >> This would result in potential superfluous interrupts, however under > >> load you would allow the ithread to loop without taking the thread lock > >> some number of times. > >> > >> I am not really sure if this is really worth the optimization > >> (especially since it can result in superfluous interrupts) however it > >> may reduce latency and that might be important. > >> > >> Is there some people that I can pass the patch onto for help with > >> performance once we confirm that this is the actual bug? We can do > >> internal testing, but I am worried about regressing performance of any > >> form of IO for the kernel. > >> > >> I'll show the patch soon. > >> > >> Thank you for the information. This is promising. > > Well, if you and I are right, the minimal patch should be > > > > diff --git a/sys/kern/kern_intr.c b/sys/kern/kern_intr.c > > index 8d63c9b..7c21015 100644 > > --- a/sys/kern/kern_intr.c > > +++ b/sys/kern/kern_intr.c > > @@ -1349,7 +1349,7 @@ ithread_loop(void *arg) > > * we are running, it will set it_need to note that we > > * should make another pass. > > */ > > - while (ithd->it_need) { > > + while (atomic_load_acq_int(&ithd->it_need)) { > > /* > > * This might need a full read and write barrier > > * to make sure that this write posts before any > > > > OK we can try this. > > I've been pretty good at locking when using mutexes, but when we get > into the atomic ops like this it gets a little tough for me to follow > without extensive research. I know that the signalling thread > (swi_sched caller) does not use any atomic ops... is this OK?
The sequence of the actions there is atomic_store_rel_int(&ih->ih_need, 1); later in intr_event_schedule_thread(): it->it_need = 1; thread_lock(td); sched_add(td); thread_unlock(td); There are two things to worry about, as I see: 1. The possibiblity of seeing it_need == 1 but ih_need == 0. This could happen on some architectures which allow the write reordering, so it might make sense to move the store_rel from ih_need to it_need. Effectively, we would need to have rel in both places, since the call to intr_event_schedule_thread() is conditional. But the reordering is impossible on x86. 2. A possibility of scheduling the interrupt thread without CPU running it noticing the it_need = 1 write. Since the thread lock is locked and released, the release barrier is applied, so the write should become visible before thread is scheduled on processor. I would start with the only addition of load_acq() for now and see if it helps.
pgptzAzz5fhEw.pgp
Description: PGP signature