[Kernel-packages] [Bug 1871874] Re: lvremove occasionally fails on nodes with multiple volumes and curtin does not catch the failure

Ryan Harper Fri, 10 Apr 2020 13:31:26 -0700

> This is in an integration lab so these hosts (including maas) are stopped,
> MAAS is reinstalled, and the systems are redeployed without any release
> or option to wipe during a MAAS release.
> Then MAAS deploys Bionic on these hosts thinking they are completely new
> systems but in reality they still have the old volumes configured. MAAS
> configures the root disk but nothing to the other disks which are
> provisioned through other automation later.


Even with a system as you describe, curtin will erase all metadata
as configured.  I do not believe that after deployment that any LVM devices
will be present on the booted system or found with LVM scan tools.

I very much would like to see the curtin install log from the scenario
you describe and any "old volumes" that appear configured after the install.

If some post-deployment script starts creating VGs and LVs, it's possible
the could find some metadata that curtin did not detect (offsets further
into the disk.  MAAS and curtin aren't responsible for wiping the entire
contents of the disk *unless* told to do so.  Curtin accepts config like:

wipe: zero

Which will zero out the entire device (disk, partition, etc).  However such
wipes may take very long.  I do not think this is a useful setting.  Instead
the post-deployment scripts should be using best-practices, just like curtin
does when dealing with reused storage.  Note that:

1) LVM tools *warn* when it finds existing metadata
2) All lvm tools include a --zero flag which will remove existing metadata
before creating; this is best practice when re-using existing storage.

Curtin also pre-wipes disks and partitions at their location in the disk
before creating things on top specifically to prevent buried metadata from
causing issues when creating new composed devices.


So please do find out more details about the post-install deployment.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1871874

Title:
  lvremove occasionally fails on nodes with multiple volumes and curtin
  does not catch the failure

Status in curtin package in Ubuntu:
  Incomplete
Status in linux package in Ubuntu:
  Incomplete

Bug description:
  For example:

  Wiping lvm logical volume: /dev/ceph-db-wal-dev-sdc/ceph-db-dev-sdi
  wiping 1M on /dev/ceph-db-wal-dev-sdc/ceph-db-dev-sdi at offsets [0, -1048576]
  using "lvremove" on ceph-db-wal-dev-sdc/ceph-db-dev-sdi
  Running command ['lvremove', '--force', '--force', 
'ceph-db-wal-dev-sdc/ceph-db-dev-sdi'] with allowed return codes [0] 
(capture=False)
  device-mapper: remove ioctl on (253:14) failed: Device or resource busy
  Logical volume "ceph-db-dev-sdi" successfully removed

  On a node with 10 disks configured as follows:

  /dev/sda2 /
  /dev/sda1 /boot
  /dev/sda3 /var/log
  /dev/sda5 /var/crash
  /dev/sda6 /var/lib/openstack-helm
  /dev/sda7 /var
  /dev/sdj1 /srv

  sdb and sdc are used for BlueStore WAL and DB
  sdd, sde, sdf: ceph OSDs, using sdb
  sdg, sdh, sdi: ceph OSDs, using sdc

  across multiple servers this happens occasionally with various disks.
  It looks like this maybe a race condition maybe in lvm as curtin is
  wiping multiple volumes before lvm fails

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/curtin/+bug/1871874/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 1871874] Re: lvremove occasionally fails on nodes with multiple volumes and curtin does not catch the failure

Reply via email to