On Thu, Aug 1, 2019 at 10:15 AM Andrea Righi <andrea.ri...@canonical.com> wrote:
> Thanks Ryan, this is very interesting: > > [ 259.411486] bcache: register_bcache() error /dev/vdg: device already > registered (emitting change event) > [ 259.537070] bcache: register_bcache() error /dev/vdg: device already > registered (emitting change event) > [ 259.797830] bcache: register_bcache() error /dev/vdg: device already > registered (emitting change event) > [ 259.900392] bcache: register_bcache() error /dev/vdg: device already > registered (emitting change event) > > It looks that we're trying to register /dev/vdg multiple times as a > backing device (make-bcache -B). I'm not getting this message during my > tests, so that might be required to reproduce that particular deadlock. > We carry a specific sauce patch to ensure that if the cacheset is already online, and a backing device shows up later, that the kernel emits the change event to trigger the udev rules to generate the symlink for /dev/bcache/by-uuid. I don't think the patch we carry is at issue since we are just detecting the re-register scenario and emitting a change uevent; https://www.spinics.net/lists/linux-bcache/msg05833.html We may want to resubmit that now to see if they'll take that or even want to deal with the scenario in a cleaner way; > > I'll modify my test case to trigger these errors and see if I can > reproduce the hung task timeout issue. > I can provide you a setup to reproduce this. I'll put together a doc. > -- > You received this bug notification because you are subscribed to the bug > report. > https://bugs.launchpad.net/bugs/1796292 > > Title: > Tight timeout for bcache removal causes spurious failures > > To manage notifications about this bug go to: > https://bugs.launchpad.net/curtin/+bug/1796292/+subscriptions > -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1796292 Title: Tight timeout for bcache removal causes spurious failures Status in curtin: Fix Released Status in linux package in Ubuntu: Confirmed Status in linux source package in Bionic: Confirmed Status in linux source package in Cosmic: Confirmed Status in linux source package in Disco: Confirmed Status in linux source package in Eoan: Confirmed Bug description: I've had a number of deployment faults where curtin would report Timeout exceeded for removal of /sys/fs/bcache/xxx when doing a mass- deployment of 30+ nodes. Upon retrying the node would usually deploy fine. Experimentally I've set the timeout ridiculously high, and it seems I'm getting no faults with this. I'm wondering if the timeout for removal is set too tight, or might need to be made configurable. --- curtin/util.py~ 2018-05-18 18:40:48.000000000 +0000 +++ curtin/util.py 2018-10-05 09:40:06.807390367 +0000 @@ -263,7 +263,7 @@ return _subp(*args, **kwargs) -def wait_for_removal(path, retries=[1, 3, 5, 7]): +def wait_for_removal(path, retries=[1, 3, 5, 7, 1200, 1200]): if not path: raise ValueError('wait_for_removal: missing path parameter') To manage notifications about this bug go to: https://bugs.launchpad.net/curtin/+bug/1796292/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp