Hi,

I hit this issue on Bionic, Disco and Eoan. Our (server-team) Jenkins
nodes are often filled by stale LXD containers which are left there
because of "fails to destroy ZFS filesystem" errors.

Some thoughts and qualitative observations:

0. This is not a corner case, I see the problem all the time.

1. There is probably more than one issue involved here, even we get
similar error messages when trying to delete a container.

2. One issue is about mount namespaces: stray mounts that prevent to the
container to be deleted. This issue can be worked around by entering the
namespace and unmounting. The container can then be deleted. When this
happens retrying `lxd delete` doesn't help. This is described in [0]. I
think the newer versions of LXD are way less prone to end up in this
case.

3. In other cases `lxc delete --force` fails with the "ZFS dataset is
busy" error, but the deletion succeeds if the delete is retried
immediately after. In my case I don't even need to wait for a single
second: the second delete in `lxc delete --force <x> ; lxc delete <x>`
already works. Stopping and deleting the container as separate
operations also works.

4. It has been suggested in [0] that LXD could retry the "delete"
operation if it fails. stgraber wrote that LXD *already* retries the
operation 20 times over 10 seconds, but the outcome is still a failure.
It is not clear to me how retrying manually works, while LXD auto-
retrying does not.

5. Some time ago (weeks) the error message changed from "Failed to
destroy ZFS filesystem: dataset is busy" to "Failed to destroy ZFS
filesystem:" with no other detail. I can't tell which specific upgrade
triggered this change.

6. I see this problem in both file-backed and device-backed zpools.

7. I'm not sure system load plays a role: I often hit the problem on my
lightly loaded laptop.

8. I don't have clear steps to reproduce the problem, but I personally
see it happening most of the time. While I don't have steps to reproduce
with 100% probability, I'm seeing this more times than I don't. But see
the next point.

9. In my experience a system can be in a "bad state" (the problem always
happens), or in a "good state" (the problem never happens). When the
system is in a "good state" we can `lxc delete` hundreds of containers
with no errors. I can't tell what makes a system switch from a good to a
bad state. I almost certain I also saw systems switching from a bad to a
good state.

10. The lxcfs package it not installed in the systems where I hit this
issue

That's it for the moment. Thanks for looking into this!

Paride

[0] https://github.com/lxc/lxd/issues/4656

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1779156

Title:
  lxc 'delete' fails to destroy ZFS filesystem 'dataset is busy'

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1779156/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to