Re: [slurm-users] CUDA environment variable not being set

2020-10-08 Thread Christopher Samuel
Hi Sajesh, On 10/8/20 4:18 pm, Sajesh Singh wrote: Thank you for the tip. That works as expected. No worries, glad it's useful. Do be aware that the core bindings for the GPUs would likely need to be adjusted for your hardware! Best of luck, Chris -- Chris Samuel : http://www.csamuel

Re: [slurm-users] CUDA environment variable not being set

2020-10-08 Thread Sajesh Singh
Christopher, Thank you for the tip. That works as expected. -SS- -Original Message- From: slurm-users On Behalf Of Christopher Samuel Sent: Thursday, October 8, 2020 6:52 PM To: slurm-users@lists.schedmd.com Subject: Re: [slurm-users] CUDA environment variable not being set EXTERNA

Re: [slurm-users] CUDA environment variable not being set

2020-10-08 Thread Christopher Samuel
On 10/8/20 3:48 pm, Sajesh Singh wrote: Thank you. Looks like the fix is indeed the missing file /etc/slurm/cgroup_allowed_devices_file.conf No, you don't want that, that will allow all access to GPUs whether people have requested them or not. What you want is in gres.conf and looks lik

Re: [slurm-users] CUDA environment variable not being set

2020-10-08 Thread Sajesh Singh
Relu, Thank you. Looks like the fix is indeed the missing file /etc/slurm/cgroup_allowed_devices_file.conf -SS- -Original Message- From: slurm-users On Behalf Of Christopher Samuel Sent: Thursday, October 8, 2020 6:10 PM To: slurm-users@lists.schedmd.com Subject: Re: [slurm-users]

Re: [slurm-users] CUDA environment variable not being set

2020-10-08 Thread Christopher Samuel
Hi Sajesh, On 10/8/20 11:57 am, Sajesh Singh wrote: debug:  common_gres_set_env: unable to set env vars, no device files configured I suspect the clue is here - what does your gres.conf look like? Does it list the devices in /dev for the GPUs? All the best, Chris -- Chris Samuel : http:/

Re: [slurm-users] CUDA environment variable not being set

2020-10-08 Thread Relu Patrascu
Do you have a line like this in  your cgroup_allowed_devices_file.conf /dev/nvidia* ? Relu On 2020-10-08 16:32, Sajesh Singh wrote: It seems as though the modules are loaded as when I run lsmod I get the following: nvidia_drm 43714  0 nvidia_modeset   1109636  1 nvidia_drm

Re: [slurm-users] CUDA environment variable not being set

2020-10-08 Thread Sajesh Singh
Yes. It is located in the /etc/slurm directory -- -SS- From: slurm-users On Behalf Of Brian Andrus Sent: Thursday, October 8, 2020 5:02 PM To: slurm-users@lists.schedmd.com Subject: Re: [slurm-users] CUDA environment variable not being set EXTERNAL SENDER do you have your gres.conf on the n

Re: [slurm-users] CUDA environment variable not being set

2020-10-08 Thread Brian Andrus
do you have your gres.conf on the nodes also? Brian Andrus On 10/8/2020 11:57 AM, Sajesh Singh wrote: Slurm 18.08 CentOS 7.7.1908 I have 2 M500 GPUs in a compute node which is defined in the slurm.conf and gres.conf of the cluster, but if I launch a job requesting GPUs the environment vari

Re: [slurm-users] CUDA environment variable not being set

2020-10-08 Thread Sajesh Singh
I only get a line returned for “Gres=”, but this is the same behavior on another cluster that has GPUs and the variable gets set on that cluster. -Sajesh- -- _ Sajesh Singh Manager, Systems and Scientific Computing American Museum of Natural Hi

Re: [slurm-users] CUDA environment variable not being set

2020-10-08 Thread Renfro, Michael
From any node you can run scontrol from, what does ‘scontrol show node GPUNODENAME | grep -i gres’ return? Mine return lines for both “Gres=” and “CfgTRES=”. From: slurm-users on behalf of Sajesh Singh Reply-To: Slurm User Community List Date: Thursday, October 8, 2020 at 3:33 PM To: Slurm U

Re: [slurm-users] CUDA environment variable not being set

2020-10-08 Thread Sajesh Singh
It seems as though the modules are loaded as when I run lsmod I get the following: nvidia_drm 43714 0 nvidia_modeset 1109636 1 nvidia_drm nvidia_uvm935322 0 nvidia 20390295 2 nvidia_modeset,nvidia_uvm Also the nvidia-smi command returns the followin

Re: [slurm-users] CUDA environment variable not being set

2020-10-08 Thread Relu Patrascu
That usually means you don't have the nvidia kernel module loaded, probably because there's no driver installed. Relu On 2020-10-08 14:57, Sajesh Singh wrote: Slurm 18.08 CentOS 7.7.1908 I have 2 M500 GPUs in a compute node which is defined in the slurm.conf and gres.conf of the cluster, b

[slurm-users] CUDA environment variable not being set

2020-10-08 Thread Sajesh Singh
Slurm 18.08 CentOS 7.7.1908 I have 2 M500 GPUs in a compute node which is defined in the slurm.conf and gres.conf of the cluster, but if I launch a job requesting GPUs the environment variable CUDA_VISIBLE_DEVICES Is never set and I see the following messages in the slurmd.log file: debug: co

Re: [slurm-users] Controlling access to idle nodes

2020-10-08 Thread David Baker
Thank you very much for your comments. Oddly enough, I came up with the 3-partition model as well once I'd sent my email. So, your comments helped to confirm that I was thinking on the right lines. Best regards, David From: slurm-users on behalf of Thomas M. P

Re: [slurm-users] unable to run on all the logical cores

2020-10-08 Thread William Brown
R is single threaded. On Thu, 8 Oct 2020, 07:44 Diego Zuccato, wrote: > Il 08/10/20 08:19, David Bellot ha scritto: > > > good spot. At least, scontrol show job is now saying that each job only > > requires one "CPU", so it seems all the cores are treated the same way > now. > > Though I still h