Hi Kristian, On Thu, Nov 06, 2014 at 03:12:55PM +0100, Kristian Nielsen wrote: > I have been debugging some test failures seen in Buildbot: > > https://mariadb.atlassian.net/browse/MDEV-7026 > > After some debugging, this is what I think is going on. I will refer to > MariaDB 5.5 sources, though the problem is also in 10.0. > > InnoDB/XtraDB mutexes are taken in mutex_enter_func() and unlocked in > mutex_exit_func(). > > Simplifying for clarity, here are the relevant parts of the logic: > > ENTER: > > # First spin wait, hoping to get the mutex without yielding. > os_compare_and_swap_ulint(&mutex->waiters, 0, 1); Where did you see "os_compare_and_swap_ulint(&mutex->waiters, 0, 1)"? FWICS waiters of InnoDB mutex are non-atomic loads/stores.
> if (os_atomic_test_and_set_byte(&mutex->lock_word, 1)) > sync_array_wait_event() > > RELEASE: > > os_atomic_lock_release_byte(&mutex->lock_word); > os_rmb; > waiters = *(volatile *)&mutex->waiters; > if (waiters) > mutex_signal_object(mutex); > > The interesting point here is how to ensure that we do not lose the wakeup of > a waiting thread when the holding thread releases the mutex. > > The idea is that the waiting thread first sets mutex->waiters=1. Then it > checks if the mutex is free - if not, it goes to sleep. > > The thread that releases the lock first marks the mutex free. Then it checks > if there are any waiters, and if so wakes them up. > > For this to work, we need to ensure that the setting of waiters=1 has > completed before the load of lock_word can start - a full memory barrier. > Likewise, we need the setting of lock_word to 0 to complete before the load of > mutex->waiters - again a full memory barrier. Otherwise we can get something > like this: > > ENTER RELEASE > Start store to mutex->waiters > Start store to lock_word > Load mutex->lock_word > Store to lock_word becomes visible > Load mutex->waiters > Store to waiters becomes visible > Check lock word==1 > Check waiters==0, skip wakeup > Sleep (forever) > > Or this: > > ENTER RELEASE > Start store to lock_word > Start store to mutex->waiters > Load mutex->waiters > Store to waiters becomes visible > Load mutex->lock_word > Store to lock_word becomes visible > Check lock word==1 > Sleep (forever) > Check waiters==0, skip wakeup > > (So both full barriers are needed, the one in ENTER to prevent the first > example, the one in RELEASE to prevent the second). Agree so far. > > It looks like for x86, the full barriers were broken in MariaDB 5.5.40 and > 10.0.13. > > os_compare_and_swap_ulint() is defined as either > __sync_bool_compare_and_swap() or atomic_cas_ulong(), both of which implies a > full barrier. So ENTER looks ok. See above: waiters loads/stores are not atomic, so no memory barriers at all. > > os_atomic_lock_release_byte() however is now defined as __sync_lock_release(). > This is _not_ a full barrier, just a release barrier. So the RELEASE case > seems to be incorrect, allowing the second deadlock example to occur. > > Previously, RELEASE used os_atomic_test_and_set_byte(), which is > __sync_lock_test_and_set(). This is defined in GCC docs as an "aquire" > barrier only, which is also wrong, obviously. But it happens to generate an > XCHG instruction on x86, which is a full barrier. This is probably why it > works on x86 (and probably why it did not work on PowerPC). > > Here is a small test program, and the assembly from GCC 5.7 on amd64: > > void test(unsigned *p, unsigned *q) { > while (__sync_lock_test_and_set(q, 1)) { } > *p++; > __sync_lock_release(q); > } > > test: > movl $1, %edx > .L2: > movl %edx, %eax > xchgl (%rsi), %eax > testl %eax, %eax > jne .L2 > movl $0, (%rsi) > ret > > As can be seen, the __sync_lock_test_and_set() generates a full barrier, but > the __sync_lock_release() generates nothing (as this is sufficient for release > semantics for the rather strong memory order on x86). > > So the InnoDB mutexes look completely broken from MariaDB 5.5.40 and > 10.0.13 :-( > > This is consistent with the observations in MDEV-7026. A thread is waiting for > wakeup on a mutex that is already marked free. Ok, so you say this hang is now repeatable with x86 because we changed ACQUIRE -> RELEASE. I assume you were actually able to reproduce it on x86. Note that we also added mysterious os_isync to mutex_exit_func, which is supposed to be among other things ACQUIRE. But on x86 it is no-op with a comment: /* Performance regression was observed at some conditions for Intel architecture. Disable memory barrier for Intel architecture for now. */ > > Apparently this is not a new problem. This comment in mutex_exit_func() makes > me want to cry: > > /* A problem: we assume that mutex_reset_lock word > is a memory barrier, that is when we read the waiters > field next, the read must be serialized in memory > after the reset. A speculative processor might > perform the read first, which could leave a waiting > thread hanging indefinitely. > > Our current solution call every second > sync_arr_wake_threads_if_sema_free() > to wake up possible hanging threads if > they are missed in mutex_signal_object. */ Hehe, you're not alone who cried about this comment. :) > > So someone found the mutex implementation broken, but instead of fixing it, > they implemented a horrible workaround to try and break any deadlocks that > might occur. > > (The reason that the deadlocks are not broken in the MDEV-7026 seems to be > that it occurs during startup, before starting the srv_error_monitor_thread() > that does the sync_arr_wake_threads_if_sema_free()). > > Now, what is also interesting is that this seems to explain very nicely the > many server lockups we have seen for users that upgrade to recent MariaDB > versions. I was doing a very detailed investigation of one such case where the > user was able to supply unusually detailed information. I found a thread that > was stuck waiting on a mutex, but no other threads seemed to be holding that > mutex. This is exactly what would be seen in case of a lost wakeup due to this > issue. Furthermore, the sync_arr_wake_threads_if_sema_free() mechanism, as > well as the normal InnoDB "long semaphore wait", was itself blocked on that > same mutex wait, which explains why they were not effective for the user in > that particular case. Does it explain server lockups at startup or during normal operations? Could you elaborate the problem you're talking about? > > So it looks like we need to revert the incorrect patch and release new 5.5.40a > and 10.0.15 releases ASAP, to solve these hangs for users. > > And do a proper patch for PowerPC with proper full memory > barriers. (Incidentally, the os_rmb in RELEASE seems to also be part of some > PowerPC patch, it also looks completely wrong). I believe reverting this patch is too late: we already claimed compatibility with Power8. I'd vote for putting full memory barrier into mutex_exit_func() and probably remove that ugly hack with sync_arr_wake_threads_if_sema_free. :( Though a proper solution should never issue full memory barriers for mutex lock/unlock. I suggested different idea, but that's for 10.1: https://mariadb.atlassian.net/browse/MDEV-6654 That's unfortunate, but we attempted to fix wrong thing by wrong patch. I was kind of against this: https://mariadb.atlassian.net/browse/MDEV-6515 <quot> - mutex_get_waiters() miss acquire memory barrier. This may cause mutex_exit_func() read stale 'waiters' value and be the reason for deadlock. There seem to be a workaround for that: srv_error_monitor_thread() is supposed to wake these stale threads every second. But if that's the case, we don't really need release memory barrier in mutex_set_waiters(). </quot> s/acquire memory barrier/full memory barrier/ > > However, more importantly, this issue clearly shows some fundamental problems > in our development procedures that we need to fix ASAP. > > Just look at it. We have users (and customers!) that suffer from strange > hangs. This is their production systems that are down. Support and developers > are unable to help them due to the difficulty of debugging rare race > conditions in production systems. > > And all the time the bug was staring us in the face, easily reproducable in > our own Buildbot! But we deliberately chose to ignore those failures. The only > thing that matters is "when can you push", and getting more releases > out. Never mind what kind of crap is then pushed out to users. > > I am deeply ashamed of treating our users like this. And extremely unhappy. I > even wrote about this (the importance of not ignoring Buildbot failures) just > a few weeks ago. This was completely ignored, I got zero replies, and the > weekly calls continued with mindless juggling of Jira numbers and buggy > releases. > > As a result, our users suffer. > > We need to stop this. > > Please? > > - Kristian. > > _______________________________________________ > Mailing list: https://launchpad.net/~maria-developers > Post to : maria-developers@lists.launchpad.net > Unsubscribe : https://launchpad.net/~maria-developers > More help : https://help.launchpad.net/ListHelp Regards, Sergey _______________________________________________ Mailing list: https://launchpad.net/~maria-developers Post to : maria-developers@lists.launchpad.net Unsubscribe : https://launchpad.net/~maria-developers More help : https://help.launchpad.net/ListHelp