Hi Jim,I don't know if it makes a difference, but I only ever use the complete numeric suffix within brackets, as in
sjc01enadsapp[01-08] Otherwise I'd raise the debug level of slurmd to maximum by setting SlurmdDebug=debug5in /slurm.conf/, tail /SlurmdLogFile/ on a GPU node and then restart /slurmd/ there.
This might shed some light on what goes wrong. Cheers, Stephan On 03.05.22 20:51, Jim Kavitsky wrote:
Whoops. Sent the first to an incorrect address….apologies if this shows up as a duplicate.-jimk *From: *Jim Kavitsky <jimkavit...@lucidmotors.com> *Date: *Tuesday, May 3, 2022 at 11:46 AM *To: *slurm-us...@schedmd.com <slurm-us...@schedmd.com> *Subject: *gres/gpu count lower than reported Hello Fellow Slurm Admins,I have a new Slurm installation that was working and running basic test jobs until I added gpu support. My worker nodes are now all in drain state, with gres/gpu count reported lower than configured (0 < 4)This is in spite of the fact that nvidia-smi reports all four A100’s as active on each node. I have spent a good chunk of a week googling around for the solution to this, and trying variants of the gpu config lines/restarting daemons without any luck.The relevant lines from my current config files are below. The head node and all workers have the same gres.conf and slurm.conf files. Can anyone suggest anything else I should be looking at or adding? I’m guessing that this is a problem that many have faced, and any guidance would be greatly appreciated.root@sjc01enadsapp00:/etc/slurm-llnl# grep gpu slurm.conf GresTypes=*gpu*NodeName=sjc01enadsapp0[1-8] RealMemory=2063731 Sockets=2 CoresPerSocket=16 ThreadsPerCore=2 Gres=*gpu*:4 State=UNKNOWNroot@sjc01enadsapp00:/etc/slurm-llnl# cat gres.conf NodeName=sjc01enadsapp0[1-8] Name=gpu File=/dev/nvidia[0-3]root@sjc01enadsapp00:~# sinfo -N -o "%.20N %.15C %.10t %.10m %.15P %.15G %.75E"NODELIST CPUS(A/I/O/T)STATE MEMORY PARTITIONGRESREASONsjc01enadsapp01 0/0/64/64drain2063731Primary* gpu:4 gres/gpu count reported lower than configured (0 < 4)sjc01enadsapp02 0/0/64/64drain2063731Primary* gpu:4 gres/gpu count reported lower than configured (0 < 4)sjc01enadsapp03 0/0/64/64drain2063731Primary* gpu:4 gres/gpu count reported lower than configured (0 < 4)sjc01enadsapp04 0/0/64/64drain2063731Primary* gpu:4 gres/gpu count reported lower than configured (0 < 4)sjc01enadsapp05 0/0/64/64drain2063731Primary* gpu:4 gres/gpu count reported lower than configured (0 < 4)sjc01enadsapp06 0/0/64/64drain2063731Primary* gpu:4 gres/gpu count reported lower than configured (0 < 4)sjc01enadsapp07 0/0/64/64drain2063731Primary* gpu:4 gres/gpu count reported lower than configured (0 < 4)sjc01enadsapp08 0/0/64/64drain2063731Primary* gpu:4 gres/gpu count reported lower than configured (0 < 4)root@sjc01enadsapp07:~# nvidia-smi Tue May 3 18:41:34 2022 +-----------------------------------------------------------------------------+| NVIDIA-SMI 470.103.01 Driver Version: 470.103.01 CUDA Version: 11.4 ||-------------------------------+----------------------+----------------------+| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC || Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. || | | MIG M. | |===============================+======================+======================| | 0 NVIDIA A100-PCI... On | 00000000:17:00.0 Off | 0 || N/A 42C P0 49W / 250W | 4MiB / 40536MiB | 0% Default || | | Disabled | +-------------------------------+----------------------+----------------------+ | 1 NVIDIA A100-PCI... On | 00000000:65:00.0 Off | 0 || N/A 41C P0 48W / 250W | 4MiB / 40536MiB | 0% Default || | | Disabled | +-------------------------------+----------------------+----------------------+ | 2 NVIDIA A100-PCI... On | 00000000:CA:00.0 Off | 0 || N/A 35C P0 44W / 250W | 4MiB / 40536MiB | 0% Default || | | Disabled | +-------------------------------+----------------------+----------------------+ | 3 NVIDIA A100-PCI... On | 00000000:E3:00.0 Off | 0 || N/A 38C P0 45W / 250W | 4MiB / 40536MiB | 0% Default || | | Disabled | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 2179 G /usr/lib/xorg/Xorg 4MiB | | 1 N/A N/A 2179 G /usr/lib/xorg/Xorg 4MiB | | 2 N/A N/A 2179 G /usr/lib/xorg/Xorg 4MiB | | 3 N/A N/A 2179 G /usr/lib/xorg/Xorg 4MiB | +-----------------------------------------------------------------------------+This message and any attachments are Confidential Information, for the exclusive use of the addressee and may be legally privileged. Any receipt by anyone other than the intended addressee does not constitute a loss of the confidential or privileged nature of the communication. Any other distribution, use or reproduction is unauthorized and prohibited. If you are not the intended recipient, please contact the sender by return electronic mail and delete all copies of this communication
-- ETH Zurich Stephan Roth Systems Administrator IT Support Group (ISG) D-ITET ETF D 104 Sternwartstrasse 7 8092 Zurich Phone +41 44 632 30 59 stephan.r...@ee.ethz.ch www.isg.ee.ethz.ch Working days: Mon,Tue,Thu,Fri
smime.p7s
Description: S/MIME Cryptographic Signature