Hi Ole, Yes, it's very similar. I've put our systemd unit file also online on https://gist.github.com/wpoely86/cf88e8e41ee885677082a7b08e12ae11
And we add it as a dependency for slurmd: $ cat /etc/systemd/system/slurmd.service.d/wait.conf [Service] Environment="CUDA_DEVICE_ORDER=PCI_BUS_ID" LimitMEMLOCK=infinity [Unit] After=waitforib.service Requires=munge.service Wants=waitforib.service So far this has worked flawlessly. Ward On 2/11/2023 09:28, Ole Holm Nielsen wrote:
Hi Ward, Thanks a lot for the feedback! The method of probing /sys/class/infiniband/*/ports/*/state is also used in the NHC script lbnl_hw.nhc and has the advantage of not depending on the nmcli command from the NetworkManager package. Can I ask you how you implement your script as a service in the Systemd booting process, perhaps similar to Max's solution in https://github.com/maxlxl/network.target_wait-for-interfaces ? Thanks, Ole On 11/1/23 20:09, Ward Poelmans wrote:We have a slightly difference script to do the same. It only relies on /sys: # Search for infiniband devices and check waits until # at least one reports that it is ACTIVE if [[ ! -d /sys/class/infiniband ]] then logger "No infiniband found" exit 0 fi ports=$(ls /sys/class/infiniband/*/ports/*/state) for (( count = 0; count < 300; count++ )) do for port in ${ports}; do if grep -qc ACTIVE $port; then logger "Infiniband online at $port" exit 0 fi done sleep 1 done logger "Failed to find an active infiniband interface" exit 1
smime.p7s
Description: S/MIME Cryptographic Signature