On Mon, 18 Nov 2024 09:58:03 +0100 Uwe Kleine-König wrote: [...] > On Wed, Nov 13, 2024 at 11:15:03PM +0100, Francesco Poli wrote: > > On Mon, 11 Nov 2024 11:22:26 +0100 Uwe Kleine-König wrote: [...] > > > I guess the kernel provides a directory "/sys/class/infiniband_mad". Do > > > its contents look different on 6.10.x and 6.11.x? > > > > I will look into this as soon as I can reboot the cluster head node.
I looked into this, while testing the new Debian Linux kernel that has just migrated to testing (which, once again, makes opensm fail to start, just like other 6.11.x versions). With a working kernel: $ uname -v #1 SMP PREEMPT_DYNAMIC Debian 6.10.11-1 (2024-09-22) $ ls -altrF /sys/class/infiniband_mad/ total 0 lrwxrwxrwx 1 root root 0 Nov 4 15:58 umad0 -> ../../devices/pci0000:80/0000:80:01.1/0000:81:00.0/infiniband_mad/umad0/ lrwxrwxrwx 1 root root 0 Nov 4 15:58 umad1 -> ../../devices/pci0000:80/0000:80:01.1/0000:81:00.1/infiniband_mad/umad1/ lrwxrwxrwx 1 root root 0 Nov 11 15:54 issm1 -> ../../devices/pci0000:80/0000:80:01.1/0000:81:00.1/infiniband_mad/issm1/ lrwxrwxrwx 1 root root 0 Nov 11 15:54 issm0 -> ../../devices/pci0000:80/0000:80:01.1/0000:81:00.0/infiniband_mad/issm0/ drwxr-xr-x 2 root root 0 Nov 11 15:54 ./ drwxr-xr-x 72 root root 0 Nov 11 15:54 ../ -r--r--r-- 1 root root 4096 Nov 11 15:54 abi_version $ cat /sys/class/infiniband_mad/abi_version 5 With a kernel that makes opensm fail to start: $ uname -v #1 SMP PREEMPT_DYNAMIC Debian 6.11.7-1 (2024-11-09) $ ls -altrF /sys/class/infiniband_mad/ total 0 drwxr-xr-x 73 root root 0 Nov 18 09:41 ../ -r--r--r-- 1 root root 4096 Nov 18 09:41 abi_version lrwxrwxrwx 1 root root 0 Nov 18 09:41 umad0 -> ../../devices/pci0000:80/0000:80:01.1/0000:81:00.0/infiniband_mad/umad0/ lrwxrwxrwx 1 root root 0 Nov 18 09:41 umad1 -> ../../devices/pci0000:80/0000:80:01.1/0000:81:00.1/infiniband_mad/umad1/ drwxr-xr-x 2 root root 0 Nov 18 09:43 ./ $ cat /sys/class/infiniband_mad/abi_version 5 As you can see, a couple of files (symlinks) are missing here... Does this ring a bell? Can you tell what's wrong, by just looking at this? Or, at least, do you get some less vague idea of what's going on? [...] > > Before I go on and try to install the resulting Debian package, could > > you please review the transcript of what I did (see the attached file)? > > Looks good. Probably the individual answers don't matter much and the > default should be fine. Just continue with my instructions and if the > resulting kernels boots and behave as the respective versions packaged > by Debian, everything is fine. Iff that fails, a more detailed review is > needed. Thanks for confirming, I really hope I can find a time window, where I can bisect... -- http://www.inventati.org/frx/ There's not a second to spare! To the laboratory! ..................................................... Francesco Poli . GnuPG key fpr == CA01 1147 9CD2 EFDF FB82 3925 3E1C 27E1 1F69 BFFE
pgpsbXT9Sf2mi.pgp
Description: PGP signature