Public bug reported:

Hi,
I've just seen this on reboot - and I have to admit I'm not even entirely sure 
we have to fix this as it is a conflict of config <-> system. But let me start 
with first things first.

I have a system with a whole bunch of network card.
1. a four port NetXtreme BCM5719
2. a single MLX ConnectX-4 Lx
3. a two port Intel X540-AT2

The system I have is MAAS deployed, so Maas has config data for the
system describing these (I guess).

I'm using DPDK on that system which happens to sometimes means I'm
unassinging the "normal" driver and replacing it with e.g. vfio-pci for
use in userspace. That will make the card disappear from a classic
systems POV like `ip`, but it is still there in e.g. `lspci`.

Now the problem I'm seeing is after my workload reassigned two of those
devices as seen here:

$ dpdk-devbind.py --status

Network devices using DPDK-compatible driver
============================================
0000:04:00.0 'Ethernet Controller 10-Gigabit X540-AT2 1528' drv=uio_pci_generic 
unused=ixgbe,vfio-pci
0000:04:00.1 'Ethernet Controller 10-Gigabit X540-AT2 1528' drv=uio_pci_generic 
unused=ixgbe,vfio-pci

Network devices using kernel driver
===================================
0000:02:00.0 'NetXtreme BCM5719 Gigabit Ethernet PCIe 1657' if=eno1 drv=tg3 
unused=vfio-pci,uio_pci_generic *Active*
0000:02:00.1 'NetXtreme BCM5719 Gigabit Ethernet PCIe 1657' if=eno2 drv=tg3 
unused=vfio-pci,uio_pci_generic 
0000:02:00.2 'NetXtreme BCM5719 Gigabit Ethernet PCIe 1657' if=eno3 drv=tg3 
unused=vfio-pci,uio_pci_generic 
0000:02:00.3 'NetXtreme BCM5719 Gigabit Ethernet PCIe 1657' if=eno4 drv=tg3 
unused=vfio-pci,uio_pci_generic 
0000:08:00.0 'MT27710 Family [ConnectX-4 Lx] 1015' if=ens1 drv=mlx5_core 
unused=vfio-pci,uio_pci_generic 


If I reboot the system while in that mode (and it will be assigned that way on 
reboot again) it happens that formerly configured MACs are not present and 
cloud init will complain.

...
[  244.164245] cloud-init[1256]: failed run of stage init
[  244.176231] cloud-init[1256]: 
------------------------------------------------------------
[  244.192244] cloud-init[1256]: Traceback (most recent call last):
[  244.204241] cloud-init[1256]:   File 
"/usr/lib/python3/dist-packages/cloudinit/cmd/main.py", line 384, in main_init
[  244.216146] cloud-init[1256]:     init.fetch(existing=existing)
[  244.228220] cloud-init[1256]:   File 
"/usr/lib/python3/dist-packages/cloudinit/stages.py", line 432, in fetch
[  244.240199] cloud-init[1256]:     return 
self._get_data_source(existing=existing)
[  244.252143] cloud-init[1256]:   File 
"/usr/lib/python3/dist-packages/cloudinit/stages.py", line 323, in 
_get_data_source
[  244.264134] cloud-init[1256]:     (ds, dsname) = sources.find_source(
[  244.276209] cloud-init[1256]:   File 
"/usr/lib/python3/dist-packages/cloudinit/sources/__init__.py", line 923, in 
find_source
[  244.288117] cloud-init[1256]:     raise DataSourceNotFoundException(msg)
[  244.300224] cloud-init[1256]: cloudinit.sources.DataSourceNotFoundException: 
Did not find any data source, searched classes: (DataSourceMAAS)
[  244.312144] cloud-init[1256]: During handling of the above exception, 
another exception occurred:
[  244.324119] cloud-init[1256]: Traceback (most recent call last):
[  244.336203] cloud-init[1256]:   File 
"/usr/lib/python3/dist-packages/cloudinit/cmd/main.py", line 761, in 
status_wrapper
[  244.348129] cloud-init[1256]:     ret = functor(name, args)
[  244.360201] cloud-init[1256]:   File 
"/usr/lib/python3/dist-packages/cloudinit/cmd/main.py", line 406, in main_init
[  244.372143] cloud-init[1256]:     
init.apply_network_config(bring_up=bring_up_interfaces)
[  244.384132] cloud-init[1256]:   File 
"/usr/lib/python3/dist-packages/cloudinit/stages.py", line 908, in 
apply_network_config
[  244.396115] cloud-init[1256]:     
self.distro.networking.wait_for_physdevs(netcfg)
[  244.408110] cloud-init[1256]:   File 
"/usr/lib/python3/dist-packages/cloudinit/distros/networking.py", line 177, in 
wait_for_physdevs
[  244.420109] cloud-init[1256]:     raise RuntimeError(msg)
[  244.432230] cloud-init[1256]: RuntimeError: Not all expected physical 
devices present: {'8c:dc:d4:b3:6d:e8', '8c:dc:d4:b3:6d:e9'}
[  244.444140] cloud-init[1256]: 
------------------------------------------------------------
[  304.197386] cloud-init[1256]: 2022-03-31 05:47:39,843 - 
handlers.py[WARNING]: failed posting event: finish: init-network: SUCCESS: 
searching for network datasources


There might be a related or unrelated (not sure) later crash on not finding any 
datasource. But you'll see so in the logs that I'll upload.
And as I said you "might" say you configured these devices and they are not 
there what are we supposed to do, but seeing the crash I wondered if there 
might be a better way and wnated to bring it up for your consideration.

** Affects: cloud-init
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to cloud-init.
https://bugs.launchpad.net/bugs/1967222

Title:
  Some hardening for vfio devices being less fatal at reboot

Status in cloud-init:
  New

Bug description:
  Hi,
  I've just seen this on reboot - and I have to admit I'm not even entirely 
sure we have to fix this as it is a conflict of config <-> system. But let me 
start with first things first.

  I have a system with a whole bunch of network card.
  1. a four port NetXtreme BCM5719
  2. a single MLX ConnectX-4 Lx
  3. a two port Intel X540-AT2

  The system I have is MAAS deployed, so Maas has config data for the
  system describing these (I guess).

  I'm using DPDK on that system which happens to sometimes means I'm
  unassinging the "normal" driver and replacing it with e.g. vfio-pci
  for use in userspace. That will make the card disappear from a classic
  systems POV like `ip`, but it is still there in e.g. `lspci`.

  Now the problem I'm seeing is after my workload reassigned two of
  those devices as seen here:

  $ dpdk-devbind.py --status

  Network devices using DPDK-compatible driver
  ============================================
  0000:04:00.0 'Ethernet Controller 10-Gigabit X540-AT2 1528' 
drv=uio_pci_generic unused=ixgbe,vfio-pci
  0000:04:00.1 'Ethernet Controller 10-Gigabit X540-AT2 1528' 
drv=uio_pci_generic unused=ixgbe,vfio-pci

  Network devices using kernel driver
  ===================================
  0000:02:00.0 'NetXtreme BCM5719 Gigabit Ethernet PCIe 1657' if=eno1 drv=tg3 
unused=vfio-pci,uio_pci_generic *Active*
  0000:02:00.1 'NetXtreme BCM5719 Gigabit Ethernet PCIe 1657' if=eno2 drv=tg3 
unused=vfio-pci,uio_pci_generic 
  0000:02:00.2 'NetXtreme BCM5719 Gigabit Ethernet PCIe 1657' if=eno3 drv=tg3 
unused=vfio-pci,uio_pci_generic 
  0000:02:00.3 'NetXtreme BCM5719 Gigabit Ethernet PCIe 1657' if=eno4 drv=tg3 
unused=vfio-pci,uio_pci_generic 
  0000:08:00.0 'MT27710 Family [ConnectX-4 Lx] 1015' if=ens1 drv=mlx5_core 
unused=vfio-pci,uio_pci_generic 

  
  If I reboot the system while in that mode (and it will be assigned that way 
on reboot again) it happens that formerly configured MACs are not present and 
cloud init will complain.

  ...
  [  244.164245] cloud-init[1256]: failed run of stage init
  [  244.176231] cloud-init[1256]: 
------------------------------------------------------------
  [  244.192244] cloud-init[1256]: Traceback (most recent call last):
  [  244.204241] cloud-init[1256]:   File 
"/usr/lib/python3/dist-packages/cloudinit/cmd/main.py", line 384, in main_init
  [  244.216146] cloud-init[1256]:     init.fetch(existing=existing)
  [  244.228220] cloud-init[1256]:   File 
"/usr/lib/python3/dist-packages/cloudinit/stages.py", line 432, in fetch
  [  244.240199] cloud-init[1256]:     return 
self._get_data_source(existing=existing)
  [  244.252143] cloud-init[1256]:   File 
"/usr/lib/python3/dist-packages/cloudinit/stages.py", line 323, in 
_get_data_source
  [  244.264134] cloud-init[1256]:     (ds, dsname) = sources.find_source(
  [  244.276209] cloud-init[1256]:   File 
"/usr/lib/python3/dist-packages/cloudinit/sources/__init__.py", line 923, in 
find_source
  [  244.288117] cloud-init[1256]:     raise DataSourceNotFoundException(msg)
  [  244.300224] cloud-init[1256]: 
cloudinit.sources.DataSourceNotFoundException: Did not find any data source, 
searched classes: (DataSourceMAAS)
  [  244.312144] cloud-init[1256]: During handling of the above exception, 
another exception occurred:
  [  244.324119] cloud-init[1256]: Traceback (most recent call last):
  [  244.336203] cloud-init[1256]:   File 
"/usr/lib/python3/dist-packages/cloudinit/cmd/main.py", line 761, in 
status_wrapper
  [  244.348129] cloud-init[1256]:     ret = functor(name, args)
  [  244.360201] cloud-init[1256]:   File 
"/usr/lib/python3/dist-packages/cloudinit/cmd/main.py", line 406, in main_init
  [  244.372143] cloud-init[1256]:     
init.apply_network_config(bring_up=bring_up_interfaces)
  [  244.384132] cloud-init[1256]:   File 
"/usr/lib/python3/dist-packages/cloudinit/stages.py", line 908, in 
apply_network_config
  [  244.396115] cloud-init[1256]:     
self.distro.networking.wait_for_physdevs(netcfg)
  [  244.408110] cloud-init[1256]:   File 
"/usr/lib/python3/dist-packages/cloudinit/distros/networking.py", line 177, in 
wait_for_physdevs
  [  244.420109] cloud-init[1256]:     raise RuntimeError(msg)
  [  244.432230] cloud-init[1256]: RuntimeError: Not all expected physical 
devices present: {'8c:dc:d4:b3:6d:e8', '8c:dc:d4:b3:6d:e9'}
  [  244.444140] cloud-init[1256]: 
------------------------------------------------------------
  [  304.197386] cloud-init[1256]: 2022-03-31 05:47:39,843 - 
handlers.py[WARNING]: failed posting event: finish: init-network: SUCCESS: 
searching for network datasources

  
  There might be a related or unrelated (not sure) later crash on not finding 
any datasource. But you'll see so in the logs that I'll upload.
  And as I said you "might" say you configured these devices and they are not 
there what are we supposed to do, but seeing the crash I wondered if there 
might be a better way and wnated to bring it up for your consideration.

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-init/+bug/1967222/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to     : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

Reply via email to