Re: [slurm-users] Switch setting in slurm.conf breaks slurmctld if the switch type is not there in slurmcrld node

2022-10-27 Thread Richard Chang
Yes, the system is a HPE Cray EX, and I am trying to use switch/hpe_slingshot. RC On 10/28/2022 11:21 AM, Ole Holm Nielsen wrote: On 10/28/22 07:35, Richard Chang wrote: I have observed that when I specify a switch type in the slurm.conf file and that particular switch type is not present in

Re: [slurm-users] Switch setting in slurm.conf breaks slurmctld if the switch type is not there in slurmcrld node

2022-10-27 Thread Ole Holm Nielsen
On 10/28/22 07:35, Richard Chang wrote: I have observed that when I specify a switch type in the slurm.conf file and that particular switch type is not present in the slurmctld node, slurmctld panics and shuts down. Is this expected ? My slurmctld doesn't have the switch type, but the computes

[slurm-users] Switch setting in slurm.conf breaks slurmctld if the switch type is not there in slurmcrld node

2022-10-27 Thread Richard Chang
Hi, I have observed that when I specify a switch type in the slurm.conf file and that particular switch type is not present in the slurmctld node, slurmctld panics and shuts down. Is this expected ? My slurmctld doesn't have the switch type, but the computes have that switch type. how can I s

Re: [slurm-users] GPU Allocation does not limit number of available GPUs in job

2022-10-27 Thread Sean Maxwell
No problem! Glad it is working for you now. Best, -Sean On Thu, Oct 27, 2022 at 1:46 PM Dominik Baack < dominik.ba...@cs.uni-dortmund.de> wrote: > Thank you very much! > > Those were the missing settings! > > I am not sure how I overlooked it for nearly two days, but I am happy that > its worki

Re: [slurm-users] GPU Allocation does not limit number of available GPUs in job

2022-10-27 Thread Dominik Baack
Thank you very much! Those were the missing settings! I am not sure how I overlooked it for nearly two days, but I am happy that its working now. Cheers Dominik Baack Am 27.10.2022 um 19:23 schrieb Sean Maxwell: It looks like you are missing some of the slurm.conf entries related to enforc

Re: [slurm-users] GPU Allocation does not limit number of available GPUs in job

2022-10-27 Thread Sean Maxwell
It looks like you are missing some of the slurm.conf entries related to enforcing the cgroup restrictions. I would go through the list here and verify/adjust your configuration: https://slurm.schedmd.com/cgroup.conf.html#OPT_/etc/slurm/slurm.conf Best, -Sean On Thu, Oct 27, 2022 at 1:04 PM Do

Re: [slurm-users] GPU Allocation does not limit number of available GPUs in job

2022-10-27 Thread Dominik Baack
Hi, yes ContrainDevices is set: ### # Slurm cgroup support configuration file ### CgroupAutomount=yes # #CgroupMountpoint="/sys/fs/cgroup" ConstrainCores=yes ConstrainDevices=yes ConstrainRAMSpace=yes # # I attached the slurm configuration file as well Cheers Dominik Am 27.10.2022 um 17:57 sc

Re: [slurm-users] Dell <> GPU compatibility matrix?

2022-10-27 Thread Jens Dreger
Hi Chip! Without owning any R640s or R650s, my impression is that you will only be able to install a single slot GPU due to space limitations. There is a single slot Tesla with 24GB: Tesla A10 [1]. This will only fit if you have the R640/R650 version with full profile slots. But the biggest proble

Re: [slurm-users] GPU Allocation does not limit number of available GPUs in job

2022-10-27 Thread Sean Maxwell
Hi Dominik, Do you have ConstrainDevices=yes set in your cgroup.conf? Best, -Sean On Thu, Oct 27, 2022 at 11:49 AM Dominik Baack < dominik.ba...@cs.uni-dortmund.de> wrote: > Hi, > > We are in the process of setting up SLURM on some DGX A100 nodes . We > are experiencing the problem that all GP

Re: [slurm-users] Dell <> GPU compatibility matrix?

2022-10-27 Thread Fulcomer, Samuel
The NVIDIA A10 would probably work. Check the Dell specs for card lengths that it can accommodate. It's also passively cooled, so you'd need to ensure that there's good airflow through the card. The proof would be installing a card, and watching the temp when you run apps on it. It's 150W, so not t

[slurm-users] GPU Allocation does not limit number of available GPUs in job

2022-10-27 Thread Dominik Baack
Hi, We are in the process of setting up SLURM on some DGX A100 nodes . We are experiencing the problem that all GPUs are available for users, even for jobs where only one should be assigned. It seems the requirement is forwarded correctly to the node, at least CUDA_VISIBLE_DEVICES is set to

Re: [slurm-users] Dell <> GPU compatibility matrix?

2022-10-27 Thread Sean Caron
Hi Chip, Here's the page I've been using for reference: https://www.dell.com/en-us/dt/servers/server-accelerators.htm Best, Sean On Thu, Oct 27, 2022 at 11:03 AM Chip Seraphine wrote: > > We have a cluster of 1U dells (R640s and R650s) and we’ve been asked to > install GPUs in them, specifi

[slurm-users] Dell <> GPU compatibility matrix?

2022-10-27 Thread Chip Seraphine
We have a cluster of 1U dells (R640s and R650s) and we’ve been asked to install GPUs in them, specifically NVIDIA Teslas with at least 24GB RAM, so I’m trying to select the right card. In the past I’ve used Tesla T4s on similar hardware, but those are limited to 16GB. I know most of the reall

[slurm-users] salloc problem

2022-10-27 Thread Gizo Nanava
Hello, we run into another issue when using salloc interactively on a cluster where Slurm power saving is enabled. The problem seems to be caused by the job_container plugin and occurs when the job starts on a node which boots from a power down state. If I resubmit a job immediately after the