Ryan, unfortunately the last reproducer script is giving me a lot of errors and I'm still trying to figure out how to make it run to the end (or at least to a point where it's start to run some bcache commands).
In the meantime (as anticipated on IRC) I've uploaded a test kernel reverting the patch "UBUNTU: SAUCE: (no-up) bcache: decouple emitting a cached_dev CHANGE uevent": https://kernel.ubuntu.com/~arighi/LP-1796292/4.15.0-56.62~lp1796292+1/ As we know, this would re-introduce the problem discussed in bug 1729145, but it'd be interesting to test it anyway, just to see if this patch is somehow related to the bch_bucket_alloc() deadlock. In addition to that I've spent some time looking at the last kernel trace and the code. It looks like bch_bucket_alloc() is always releasing the mutex &ca->set->bucket_lock when it goes to sleep (call to schedule()), but it doesn't release bch_register_lock, that might be also acquired. I was wondering if this could the reason of this deadlock, so I've prepared an additional test kernel that does *not* revert our "UBUNTU SAUCE" patch, but instead it releases the mutex bch_register_lock when bch_bucket_alloc() goes to sleep: https://kernel.ubuntu.com/~arighi/LP-1796292/4.15.0-56.62~lp1796292+3/ Sorry for asking all these tests... if I can't find a way to reproduce the bug on my side, asking you to test is the only way that I have to debug this issue. :) -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1796292 Title: Tight timeout for bcache removal causes spurious failures Status in curtin: Fix Released Status in linux package in Ubuntu: Confirmed Status in linux source package in Bionic: Confirmed Status in linux source package in Cosmic: Confirmed Status in linux source package in Disco: Confirmed Status in linux source package in Eoan: Confirmed Bug description: I've had a number of deployment faults where curtin would report Timeout exceeded for removal of /sys/fs/bcache/xxx when doing a mass- deployment of 30+ nodes. Upon retrying the node would usually deploy fine. Experimentally I've set the timeout ridiculously high, and it seems I'm getting no faults with this. I'm wondering if the timeout for removal is set too tight, or might need to be made configurable. --- curtin/util.py~ 2018-05-18 18:40:48.000000000 +0000 +++ curtin/util.py 2018-10-05 09:40:06.807390367 +0000 @@ -263,7 +263,7 @@ return _subp(*args, **kwargs) -def wait_for_removal(path, retries=[1, 3, 5, 7]): +def wait_for_removal(path, retries=[1, 3, 5, 7, 1200, 1200]): if not path: raise ValueError('wait_for_removal: missing path parameter') To manage notifications about this bug go to: https://bugs.launchpad.net/curtin/+bug/1796292/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp