[Kernel-packages] [Bug 1871874] Re: lvremove occasionally fails on nodes with multiple volumes and curtin does not catch the failure

Nick Niehoff Fri, 10 Apr 2020 09:56:26 -0700

Ryan,
   We believe this is a bug as we expect curtin to wipe the disks.  In this 
case it's failing to wipe the disks and occasionally that causes issues with 
our automation deploying ceph on those disks.  This may be more of an issue 
with LVM and a race condition trying to wipe all of the disks sequentially 
simply with the large number of disks/vgs/lvs.
 
   To clarify from my previous testing, I was mistaken, I thought MAAS used the 
commissioning OS as the ephemeral OS from which to deploy from, this is not the 
case.  MAAS uses the specified deployment OS as the ephemeral image to deploy 
from.  Based on this all of my previous testing was done with Bionic using the 
4.15 kernel.  This proves it is a race condition somewhere as sometimes this 
error does not reproduce and it was just a coincidence that I was changing the 
commissioning OS.


   I have tested this morning and have been able to reproduce the issue
with bionic 4.15 and xenial 4.4 however I have yet to reproduce it using
either bionic or xenial hwe kernels.

   I will upload the curtin logs and config from my reproducer now.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1871874

Title:
  lvremove occasionally fails on nodes with multiple volumes and curtin
  does not catch the failure

Status in curtin package in Ubuntu:
  Incomplete
Status in linux package in Ubuntu:
  Incomplete

Bug description:
  For example:

  Wiping lvm logical volume: /dev/ceph-db-wal-dev-sdc/ceph-db-dev-sdi
  wiping 1M on /dev/ceph-db-wal-dev-sdc/ceph-db-dev-sdi at offsets [0, -1048576]
  using "lvremove" on ceph-db-wal-dev-sdc/ceph-db-dev-sdi
  Running command ['lvremove', '--force', '--force', 
'ceph-db-wal-dev-sdc/ceph-db-dev-sdi'] with allowed return codes [0] 
(capture=False)
  device-mapper: remove ioctl on (253:14) failed: Device or resource busy
  Logical volume "ceph-db-dev-sdi" successfully removed

  On a node with 10 disks configured as follows:

  /dev/sda2 /
  /dev/sda1 /boot
  /dev/sda3 /var/log
  /dev/sda5 /var/crash
  /dev/sda6 /var/lib/openstack-helm
  /dev/sda7 /var
  /dev/sdj1 /srv

  sdb and sdc are used for BlueStore WAL and DB
  sdd, sde, sdf: ceph OSDs, using sdb
  sdg, sdh, sdi: ceph OSDs, using sdc

  across multiple servers this happens occasionally with various disks.
  It looks like this maybe a race condition maybe in lvm as curtin is
  wiping multiple volumes before lvm fails

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/curtin/+bug/1871874/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 1871874] Re: lvremove occasionally fails on nodes with multiple volumes and curtin does not catch the failure

Reply via email to