[slurm-users] Shard conf weirdness

2025-02-24 Thread Reed Dier via slurm-users
Hoping someone can help me pin down the weirdness I’m experiencing. There are actually two issues, I’ve run into, the root issue, and then something odd when trying to work around the root issue. v23.11.10 - Ubuntu 22.04 - slurm-smd debs built

[slurm-users] Re: Using sharding

2024-07-05 Thread Reed Dier via slurm-users
I would try specifying cpus and mem just to be sure its not requesting 0/all. Also, I was running into a weird issue when I had oversubscribe=yes:2 causing odd issues in my lab cluster when playing with shards, where they would go pending resources despite no alloc of gpu/shards. Once I reverted

[slurm-users] Re: File-less NVIDIA GeForce 4070 Ti being removed from GRES list

2024-04-02 Thread Reed Dier via slurm-users
Assuming that you have the cuda drivers installed correctly (nvidia-smi for instance), You should create a gres.conf with just this line: > AutoDetect=nvml If that doesn’t automagically begin working, you can increase the verbosity of slurmd with > SlurmdDebug=debug2 It should then print a bu

[slurm-users] Re: GPU shards not exclusive

2024-02-29 Thread Reed Dier via slurm-users
Hi Will, I appreciate your corroboration. After we upgraded to 23.02.$latest, it seemed to make it easier to reproduce than before. However, the issue appears to have subsided, and the only change I can potentially attribute it to was after turning on > SlurmctldParameters=rl_enable in slurm.c

[slurm-users] GPU shards not exclusive

2024-02-14 Thread Reed Dier via slurm-users
I seem to have run into an edge case where I’m able to oversubscribe a specific subset of GPUs on one host in particular. Slurm 22.05.8 Ubuntu 20.04 cgroups v1 (ProctrackType=proctrack/cgroup) It seems to be partly a corner case with a couple of caveats. This host has 2 different GPU types in th

[slurm-users] sinfo gresused shard count wrong/incomplete

2024-02-07 Thread Reed Dier via slurm-users
I have a bash script that grabs current statistics from sinfo to ship into a time series database to use for Grafana dashboards. We recently began using shards with our gpus, and I’m seeing some unexpected behavior with the data reported from sinfo. > $ sinfo -h -O "NodeHost:5 ,GresUsed:100 ,Gr