Hi Sajesh,
On 10/8/20 4:18 pm, Sajesh Singh wrote:
Thank you for the tip. That works as expected.
No worries, glad it's useful. Do be aware that the core bindings for the
GPUs would likely need to be adjusted for your hardware!
Best of luck,
Chris
--
Chris Samuel : http://www.csamuel
Christopher,
Thank you for the tip. That works as expected.
-SS-
-Original Message-
From: slurm-users On Behalf Of
Christopher Samuel
Sent: Thursday, October 8, 2020 6:52 PM
To: slurm-users@lists.schedmd.com
Subject: Re: [slurm-users] CUDA environment variable not being set
EXTERNA
On 10/8/20 3:48 pm, Sajesh Singh wrote:
Thank you. Looks like the fix is indeed the missing file
/etc/slurm/cgroup_allowed_devices_file.conf
No, you don't want that, that will allow all access to GPUs whether
people have requested them or not.
What you want is in gres.conf and looks lik
Relu,
Thank you. Looks like the fix is indeed the missing file
/etc/slurm/cgroup_allowed_devices_file.conf
-SS-
-Original Message-
From: slurm-users On Behalf Of
Christopher Samuel
Sent: Thursday, October 8, 2020 6:10 PM
To: slurm-users@lists.schedmd.com
Subject: Re: [slurm-users]
Hi Sajesh,
On 10/8/20 11:57 am, Sajesh Singh wrote:
debug: common_gres_set_env: unable to set env vars, no device files
configured
I suspect the clue is here - what does your gres.conf look like?
Does it list the devices in /dev for the GPUs?
All the best,
Chris
--
Chris Samuel : http:/
Do you have a line like this in your cgroup_allowed_devices_file.conf
/dev/nvidia*
?
Relu
On 2020-10-08 16:32, Sajesh Singh wrote:
It seems as though the modules are loaded as when I run lsmod I get
the following:
nvidia_drm 43714 0
nvidia_modeset 1109636 1 nvidia_drm
Yes. It is located in the /etc/slurm directory
--
-SS-
From: slurm-users On Behalf Of Brian
Andrus
Sent: Thursday, October 8, 2020 5:02 PM
To: slurm-users@lists.schedmd.com
Subject: Re: [slurm-users] CUDA environment variable not being set
EXTERNAL SENDER
do you have your gres.conf on the n
do you have your gres.conf on the nodes also?
Brian Andrus
On 10/8/2020 11:57 AM, Sajesh Singh wrote:
Slurm 18.08
CentOS 7.7.1908
I have 2 M500 GPUs in a compute node which is defined in the
slurm.conf and gres.conf of the cluster, but if I launch a job
requesting GPUs the environment vari
I only get a line returned for “Gres=”, but this is the same behavior on
another cluster that has GPUs and the variable gets set on that cluster.
-Sajesh-
--
_
Sajesh Singh
Manager, Systems and Scientific Computing
American Museum of Natural Hi
From any node you can run scontrol from, what does ‘scontrol show node
GPUNODENAME | grep -i gres’ return? Mine return lines for both “Gres=” and
“CfgTRES=”.
From: slurm-users on behalf of Sajesh
Singh
Reply-To: Slurm User Community List
Date: Thursday, October 8, 2020 at 3:33 PM
To: Slurm U
It seems as though the modules are loaded as when I run lsmod I get the
following:
nvidia_drm 43714 0
nvidia_modeset 1109636 1 nvidia_drm
nvidia_uvm935322 0
nvidia 20390295 2 nvidia_modeset,nvidia_uvm
Also the nvidia-smi command returns the followin
That usually means you don't have the nvidia kernel module loaded,
probably because there's no driver installed.
Relu
On 2020-10-08 14:57, Sajesh Singh wrote:
Slurm 18.08
CentOS 7.7.1908
I have 2 M500 GPUs in a compute node which is defined in the
slurm.conf and gres.conf of the cluster, b
Slurm 18.08
CentOS 7.7.1908
I have 2 M500 GPUs in a compute node which is defined in the slurm.conf and
gres.conf of the cluster, but if I launch a job requesting GPUs the environment
variable CUDA_VISIBLE_DEVICES Is never set and I see the following messages in
the slurmd.log file:
debug: co
Thank you very much for your comments. Oddly enough, I came up with the
3-partition model as well once I'd sent my email. So, your comments helped to
confirm that I was thinking on the right lines.
Best regards,
David
From: slurm-users on behalf of Thomas
M. P
R is single threaded.
On Thu, 8 Oct 2020, 07:44 Diego Zuccato, wrote:
> Il 08/10/20 08:19, David Bellot ha scritto:
>
> > good spot. At least, scontrol show job is now saying that each job only
> > requires one "CPU", so it seems all the cores are treated the same way
> now.
> > Though I still h
15 matches
Mail list logo