After further troubleshooting with cgregan, we've further narrowed this down.
We ran the following script on the node that was having trouble: https://gist.github.com/pontillo/0b92a7da2fba43fb5dce705be2dcf38b Unlike all the other devices MAAS works with, the Intel NVMe device reports a serial number that cannot be found anywhere in /dev/disk/by- id/*. When curtin is supplied a serial number, it uses a heuristic to find the device as follows: http://bazaar.launchpad.net/~curtin- dev/curtin/trunk/view/435/curtin/commands/block_meta.py#L270 http://bazaar.launchpad.net/~curtin- dev/curtin/trunk/view/435/curtin/block/__init__.py#L601 So arguably, this is a bug in the Intel NVMe serial number; the way it populates /dev/disk/* leaves much to be desired. This is *arguably* a bug in curtin (and maybe MAAS, since we knowingly use the serial number even though `udevadm` can tell us that the serial cannot be found anywhere in /dev/disk/by-id/*), in that we could do a better job dealing with devices backed by not-so-robust kernel drivers. But I think we shouldn't encourage bad behavior on the part of driver writers, so I'm on the fence about whether or not we should fix it. But mostly, I would argue that this is a bug in the Intel NVMe driver. The way they expose the device to userland is non-standard and arguably broken. When we ran `udevadm info -q all -n nvme0n1` on the device, we got the following pseudo-output: nvme0n1: P: /devices/pci0000:00/0000:00:xx.0/0000:xx:00.0/nvme/nvme0/nvme0n1 N: nvme0n1 S: SSDxxxxxxxxxx_CVMDxxxxxxxxxxxxxx S: disk/by-id/nvme-INTEL E: DEVLINKS=/dev/disk/by-id/nvme-INTEL /dev/SSDxxxxxxxxxx_CVMDxxxxxxxxxxxxxx E: DEVNAME=/dev/nvme0n1 E: DEVPATH=/devices/pci0000:00/0000:00:xx.0/0000:xx:00.0/nvme/nvme0/nvme0n1 E: DEVTYPE=disk E: ID_SERIAL=INTEL SSDxxxxxxxxxx_CVMDxxxxxxxxxxxxxx E: ID_SERIAL_SHORT=CVMDxxxxxxxxxxxxxx E: MAJOR=259 E: MINOR=0 E: SUBSYSTEM=block E: TAGS=:systemd: E: USEC_INITIALIZED=xxxxxxx You can see by the lines that start with "S:" and the "DEVLINKS=" line that the way this device is exposed is very non-standard. One would expect /dev/disk/by-id/* to contain a DEVLINK containing the serial number. Instead they expose a 'nvme-INTEL' link, which is (IMHO) a critical bug, because anyone expecting the things in /dev/disk/by-id/* to be unique will be in for a big surprise when they add a second NVMe device to a machine. ** Also affects: curtin Importance: Undecided Status: New ** Changed in: linux (Ubuntu) Status: Invalid => New ** Changed in: linux (Ubuntu Xenial) Status: Fix Committed => New -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1651602 Title: Intel NVMe driver does not expose consistent links in /dev/disk/by-id Status in curtin: New Status in MAAS: Won't Fix Status in linux package in Ubuntu: Incomplete Status in linux source package in Xenial: Incomplete Bug description: MAAS Version 2.1.1+bzr5544-0ubuntu1 (16.10.1) Deploying Xenial Nodes 1) Deploy MAAS 2.1.1 on Yakkety 2) Associate Juju 2.1 beta3 3) Juju deploy Kubernetes Core Nodes begin to deploy but fail Installation failed with exception: Unexpected error while running command. Command: ['curtin', 'block-meta', 'custom'] Exit code: 3 Reason: - Stdout: b"no disk with serial 'CVMD434500BN400AGN' found\n" To manage notifications about this bug go to: https://bugs.launchpad.net/curtin/+bug/1651602/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : [email protected] Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp

