On Wed, May 8, 2019 at 11:55 PM Trent Lloyd <trent.ll...@canonical.com>
wrote:

> I have been running into this (curtin 18.1-17-gae48e86f-
> 0ubuntu1~16.04.1)
>
> I think this commit basically agrees with my thoughts but I just wanted
> to share them explicitly in case they are interesting
>
>  (1) If you *unregister* the cache device from the backing device, it
> first has to purge all the dirty data back to the backing device. This
> may obviously take a while.
>
>  (2) When doing that, I managed to deadlock bcache at least once on
> xenial-hwe 4.15 where it was trying to reclaim memory from XFS, which I
> assume was trying to write to the bcache.. traceback:
> https://pastebin.canonical.com/117528/ - you can't get out of that
> without a reboot
>

Thanks for capturing those; Ive quite a few of my own as an unregister path
which _should_ work; but doesn't for various bugs in bcache.  I need to
attach those OOPS to this bug as well.


>
>  (3) However generally I had good luck simplying "stop"ing the cache
> devices (it seems perhaps that is what this bug is designed to do,
> switch to stop, instead of unregister?). Specifically though I was
> stopping the backing devices, and then later the cache device. It seems
> like the current commit is the other way around?
>

Unregister is just not stable, so stopping is what is being done now.

I did attempt stopping bcache devices first and only once all bcache
devices were
stopped to then stop and remove a cacheset; this proved unreliable under our
integration testing of various bcache scenarios.


>
> ** Tags added: sts
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1796292
>
> Title:
>   Tight timeout for bcache removal causes spurious failures
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/curtin/+bug/1796292/+subscriptions
>

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1796292

Title:
  Tight timeout for bcache removal causes spurious failures

Status in curtin:
  New
Status in linux package in Ubuntu:
  Confirmed

Bug description:
  I've had a number of deployment faults where curtin would report
  Timeout exceeded for removal of /sys/fs/bcache/xxx when doing a mass-
  deployment of 30+ nodes. Upon retrying the node would usually deploy
  fine. Experimentally I've set the timeout ridiculously high, and it
  seems I'm getting no faults with this. I'm wondering if the timeout
  for removal is set too tight, or might need to be made configurable.

  --- curtin/util.py~     2018-05-18 18:40:48.000000000 +0000
  +++ curtin/util.py      2018-10-05 09:40:06.807390367 +0000
  @@ -263,7 +263,7 @@
       return _subp(*args, **kwargs)
   
   
  -def wait_for_removal(path, retries=[1, 3, 5, 7]):
  +def wait_for_removal(path, retries=[1, 3, 5, 7, 1200, 1200]):
       if not path:
           raise ValueError('wait_for_removal: missing path parameter')

To manage notifications about this bug go to:
https://bugs.launchpad.net/curtin/+bug/1796292/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to