[slurm-users] How to apply for multiple GPU cards from different worker nodes?

2019-04-15 Thread Ran Du
Dear all, Does anyone know how to set #SBATCH options to get multiple GPU cards from different worker nodes? One of our users would like to apply for 16 NVIDIA V100 cards for his job, and there are 8 GPU cards on each worker node, I have tried the following #SBATCH options: #SBA

Re: [slurm-users] How to apply for multiple GPU cards from different worker nodes?

2019-04-15 Thread Antony Cleave
Ask for 8 gpus on 2 nodes instead. In your script just change the 16 to 8 and it should do what you want. You are currently asking for 2 nodes with 16 gpu each as Gres resources are per node. Antony On Mon, 15 Apr 2019, 09:08 Ran Du, wrote: > Dear all, > > Does anyone know how to set #SB

Re: [slurm-users] How to apply for multiple GPU cards from different worker nodes?

2019-04-15 Thread Ran Du
Dear Antony, Thanks a lot for your reply, I tried to submit a job with your advice, and no more sbatch errors. But because our cluster is under maintenance, I have to wait till tomorrow to see if GPU cards are allocated correctly. I will let you know as soon as the job is submitted

Re: [slurm-users] disable-bindings disables counting of gres resources

2019-04-15 Thread Peter Steinbach
Hi Chris, thanks for following up on this thread. First of all, you will want to use cgroups to ensure that processes that do not request GPUs cannot access them. We had a feeling that cgroups might be more optimal. Could you point us to documentation that suggests cgroups to be a requireme

Re: [slurm-users] disable-bindings disables counting of gres resources

2019-04-15 Thread Christopher Samuel
On 4/15/19 8:15 AM, Peter Steinbach wrote: We had a feeling that cgroups might be more optimal. Could you point us to documentation that suggests cgroups to be a requirement? Oh it's not a requirement, just that without it there's nothing to stop a process using GPUs outside of its allocation

Re: [slurm-users] disable-bindings disables counting of gres resources

2019-04-15 Thread Peter Steinbach
Hi Chris, thanks for the detailed feedback. This is slurm 18.08.5, see also https://github.com/psteinb/docker-centos7-slurm/blob/7bdb89161febacfd2dbbcb3c5684336fb73d7608/Dockerfile#L9 Best, Peter smime.p7s Description: S/MIME Cryptographic Signature

[slurm-users] scontrol update: invalid user id

2019-04-15 Thread Pi Cluster
Hi, We are doing a senior project involving the creation of a Pi Cluster. We are using 7 Raspberry Pi B+'s in this cluster. When we use sinfo to look at the status of the nodes, they appear as drained. We also encountered a problem while trying to update the state of the nodes. When trying to u

[slurm-users] Scontrol update: invalid user id

2019-04-15 Thread Shihanjian Wang
Hi, We are doing a senior project involving the creation of a Pi Cluster. We are using 7 Raspberry Pi B+'s in this cluster. When we use sinfo to look at the status of the nodes, they appear as drained. We also encountered a problem while trying to update the state of the nodes. When trying to u

Re: [slurm-users] Scontrol update: invalid user id

2019-04-15 Thread Andy Riebs
The "invalid user id" message suggests that you need to be running as root (or possibly as the slurm user?) to update the node state. Run "slurmd -Dvv" as root on one of the compute nodes and it will show you what it thinks is the socket/core/thread configuration.

Re: [slurm-users] Scontrol update: invalid user id

2019-04-15 Thread Colas Rivière
In addition, you can check why the node were set to drain with `scontrol show node | grep Reason`. The same information should also appear in the slurm controller logs (e.g. /var/log/slurm/slurmctld.log). Colas On 2019-04-15 18:03, Andy Riebs wrote: The "invalid user id" message suggests that

Re: [slurm-users] Scontrol update: invalid user id

2019-04-15 Thread Christopher Samuel
On 4/15/19 3:03 PM, Andy Riebs wrote: Run "slurmd -Dvv" as root on one of the compute nodes and it will show you what it thinks is the socket/core/thread configuration. In fact: slurmd -C will tell you what it discovers in a way that you can use in the configuration file. All the best, Ch