[[ moved to hackers@ from private mail. ]]
On 5/28/13 1:13 PM, John Baldwin wrote:
On Tuesday, May 28, 2013 3:29:41 pm Alfred Perlstein wrote:
On 5/28/13 9:04 AM, John Baldwin wrote:
On Tuesday, May 28, 2013 2:13:32 am Alfred Perlstein wrote:
Hey folks,
I had a talk with Nathan Whitehorn about the camisr issue. The issue we
are seeing we mostly know, but to summarize, we are losing the camisr
signal and the camisr is not being run.
I gave him a summary of what we have been seeing and pointed him to the
code I am concerned about here:
http://pastebin.com/tLKr7mCV (this is inside of kern_intr.c).
What I think that is happening is that the setting of it_need to 0
inside of sys/kern/kern_intr.c:ithread_loop() is not being scheduled
correctly and it is being delayed until AFTER the call to
ithread_execute_handlers() right below the atomic_store_rel_int().
This seems highly unlikely, to the extent that if this were true all our
locking primitives would be broken. The store_rel is actually a release
barrier which acts like more a read/write fence. No memory accesses (read or
write) from before the release can be scheduled after the associated store,
either by the compiler or CPU. That is what Konstantin is referring to in his
commit when he says "release" semantics".
Yes, that makes sense, however does it specify that the writes *must*
occur at that *point*? If it only enforces ordering then we may have
some issue, specifically because the setting of it to '1' inside of
intr_event_schedule_thread has no barrier other than the acq semantics
of the thread lock. I am wondering what is forcing out the '1' there.
Nothing ever forces writes. You would have to invalidate the cache to do that
and that is horribly expensive. It is always only about ordering and knowing
that if you can complete another operation on the same "cookie" variable with
acquire semantics that earlier writes will be visible.
By cookie, you mean a specific memory address, basically a lock? This is
starting to reinforce my suspicions as the setting of it_need is done
with release semantics, however the acq on the other CPU is done on the
thread lock. Maybe that is irrelevant. We will find out shortly.
See below as I think we have proof that this is somehow happening.
Having ih_need of 1 and it_need of 0 is certainly busted. The simplest fix
is probably to stop using atomics on it_need and just grab the thread lock
in the main ithread loop and hold it while checking and clearing it_need.
OK, we have some code that will either prove this, or perturb the memory
ordering enough to make the bug go away, or prove this assertion wrong.
We will update on our findings hopefully in the next few days.
Thank you for your advice.
-Alfred
-Alfred
_______________________________________________
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"