Dear all,
Does anyone know how to set #SBATCH options to get multiple GPU cards
from different worker nodes?
One of our users would like to apply for 16 NVIDIA V100 cards for his
job, and there are 8 GPU cards on each worker node, I have tried the
following #SBATCH options:
#SBA
Ask for 8 gpus on 2 nodes instead.
In your script just change the 16 to 8 and it should do what you want.
You are currently asking for 2 nodes with 16 gpu each as Gres resources are
per node.
Antony
On Mon, 15 Apr 2019, 09:08 Ran Du, wrote:
> Dear all,
>
> Does anyone know how to set #SB
Dear Antony,
Thanks a lot for your reply, I tried to submit a job with your
advice, and no more sbatch errors.
But because our cluster is under maintenance, I have to wait till
tomorrow to see if GPU cards are allocated correctly. I will let you know
as soon as the job is submitted
Hi Chris,
thanks for following up on this thread.
First of all, you will want to use cgroups to ensure that processes that do
not request GPUs cannot access them.
We had a feeling that cgroups might be more optimal. Could you point us
to documentation that suggests cgroups to be a requireme
On 4/15/19 8:15 AM, Peter Steinbach wrote:
We had a feeling that cgroups might be more optimal. Could you point us
to documentation that suggests cgroups to be a requirement?
Oh it's not a requirement, just that without it there's nothing to stop
a process using GPUs outside of its allocation
Hi Chris,
thanks for the detailed feedback. This is slurm 18.08.5, see also
https://github.com/psteinb/docker-centos7-slurm/blob/7bdb89161febacfd2dbbcb3c5684336fb73d7608/Dockerfile#L9
Best,
Peter
smime.p7s
Description: S/MIME Cryptographic Signature
Hi,
We are doing a senior project involving the creation of a Pi Cluster. We
are using 7 Raspberry Pi B+'s in this cluster.
When we use sinfo to look at the status of the nodes, they appear as
drained. We also encountered a problem while trying to update the state of
the nodes. When trying to u
Hi,
We are doing a senior project involving the creation of a Pi Cluster. We
are using 7 Raspberry Pi B+'s in this cluster.
When we use sinfo to look at the status of the nodes, they appear as
drained. We also encountered a problem while trying to update the state of
the nodes. When trying to u
The "invalid user id" message suggests that you need to be running as
root (or possibly as the slurm user?) to update the node state.
Run "slurmd -Dvv" as root on one of the compute nodes and it will show
you what it thinks is the socket/core/thread configuration.
In addition, you can check why the node were set to drain with `scontrol
show node | grep Reason`.
The same information should also appear in the slurm controller logs
(e.g. /var/log/slurm/slurmctld.log).
Colas
On 2019-04-15 18:03, Andy Riebs wrote:
The "invalid user id" message suggests that
On 4/15/19 3:03 PM, Andy Riebs wrote:
Run "slurmd -Dvv" as root on one of the compute nodes and it will show
you what it thinks is the socket/core/thread configuration.
In fact:
slurmd -C
will tell you what it discovers in a way that you can use in the
configuration file.
All the best,
Ch
11 matches
Mail list logo