[slurm-users] Re: Problems with gres.conf

2024-06-04 Thread Patryk Bełzak via slurm-users
Hi,
I believe that setting cores in gres.conf explicitly gives you better control 
over hardware configuration, I wouldn't trust slurm on that one.

We have the gres.conf along with "Cores", all you have to do is proper Numa 
discovery (as long as your hardware has numa), and then assign correct cores to 
correct gpu's. One of the simple ways to discover CPU's affinity with GPU's is 
with command `nvidia-smi topo -m` which will display HW topology. You need 
relatively new nvidia driver though.

Also keep in mind, that Intel with newer hardware made a mess with Numa and 
core bindings, we have a system which has 2 Numa nodes, and one of it has even 
core numbers, while second numa is with odd numbers. Because of that cores 
cannot be merged like [1-128], and are comma separated [1,3,5,7,(..),128]. This 
kind of output doesn't fit into `nvidia-smi` command. I think that slurm 
affinity discovery may rely on something similar to `nvidia-smi`, because when 
I assigned correct cores from NUMA (all 64) discovered with hwloc, I had the 
same error in slurmctld. I haven't investigated impact of this error, as you 
mentioned it is possible to use that resources.

Best regards,
Patryk.

On 24/05/20 04:17, Gestió Servidors via slurm-users wrote:
[-- Type: text/plain; charset=US-ASCII, Encoding: 7bit, Size: 3,7K --]
> Hello,
> 
> I am trying to rewrite my gres.conf file.
> 
> Before changes, this file was just like this:
> NodeName=node-gpu-1 AutoDetect=off Name=gpu Type=GeForceRTX2070 
> File=/dev/nvidia0 Cores=0-11
> NodeName=node-gpu-1 AutoDetect=off Name=gpu Type=GeForceGTX1080Ti 
> File=/dev/nvidia1 Cores=12-23
> NodeName=node-gpu-2 AutoDetect=off Name=gpu Type=GeForceGTX1080Ti 
> File=/dev/nvidia0 Cores=0-11
> NodeName=node-gpu-2 AutoDetect=off Name=gpu Type=GeForceGTX1080 
> File=/dev/nvidia1 Cores=12-23
> NodeName=node-gpu-3 AutoDetect=off Name=gpu Type=GeForceRTX3080 
> File=/dev/nvidia0 Cores=0-11
> NodeName=node-gpu-4 AutoDetect=off Name=gpu Type=GeForceRTX3080 
> File=/dev/nvidia0 Cores=0-7
> # you can seee that nodes node-gpu-1 and node-gpu-2 have two GPUs each one, 
> whereas nodes node-gpu-3 and node-gpu-4 have only one GPU each one
> 
> 
> And my slurmd.conf was this:
> [...]
> AccountingStorageTRES=gres/gpu
> GresTypes=gpu
> NodeName=node-gpu-1 CPUs=24 SocketsPerBoard=2 CoresPerSocket=6 
> ThreadsPerCore=2 RealMemory=96000 TmpDisk=47000 
> Gres=gpu:GeForceRTX2070:1,gpu:GeForceGTX1080Ti:1
> NodeName=node-gpu-2 CPUs=24 SocketsPerBoard=2 CoresPerSocket=6 
> ThreadsPerCore=2 RealMemory=96000 TmpDisk=47000 
> Gres=gpu:GeForceGTX1080Ti:1,gpu:GeForceGTX1080:1
> NodeName=node-gpu-3 CPUs=12 SocketsPerBoard=1 CoresPerSocket=6 
> ThreadsPerCore=2 RealMemory=23000 Gres=gpu:GeForceRTX3080:1
> NodeName=node-gpu-4 CPUs=8 SocketsPerBoard=1 CoresPerSocket=4 
> ThreadsPerCore=2 RealMemory=7800 Gres=gpu:GeForceRTX3080:1
> NodeName=node-worker-[0-22] CPUs=12 SocketsPerBoard=1 CoresPerSocket=6 
> ThreadsPerCore=2 RealMemory=47000
> [...]
> 
> With this configuration, all seems works fine, except slurmctld.log reports:
> [...]
> error: _node_config_validate: gres/gpu: invalid GRES core specification 
> (0-11) on node node-gpu-3
> error: _node_config_validate: gres/gpu: invalid GRES core specification 
> (12-23) on node node-gpu-1
> error: _node_config_validate: gres/gpu: invalid GRES core specification 
> (12-23) on node node-gpu-2
> error: _node_config_validate: gres/gpu: invalid GRES core specification (0-7) 
> on node node-gpu-4
> [...]
> 
> However, even these errors, users can submit jobs and request GPUs resources.
> 
> 
> 
> Now, I have tried to reconfigure gres.conf and slurmd.conf in this way:
> gres.conf:
> Name=gpu Type=GeForceRTX2070 File=/dev/nvidia0
> Name=gpu Type=GeForceGTX1080Ti File=/dev/nvidia1
> Name=gpu Type=GeForceGTX1080Ti File=/dev/nvidia0
> Name=gpu Type=GeForceGTX1080 File=/dev/nvidia1
> Name=gpu Type=GeForceRTX3080 File=/dev/nvidia0
> Name=gpu Type=GeForceRTX3080 File=/dev/nvidia0
> # there is no NodeName attribute
> 
> slurmd.conf:
> [...]
> NodeName=node-gpu-1 SocketsPerBoard=2 CoresPerSocket=6 ThreadsPerCore=2 
> RealMemory=96000 TmpDisk=47000 
> Gres=gpu:GeForceRTX2070:1,gpu:GeForceGTX1080Ti:1
> NodeName=node-gpu-2 SocketsPerBoard=2 CoresPerSocket=6 ThreadsPerCore=2 
> RealMemory=96000 TmpDisk=47000 
> Gres=gpu:GeForceGTX1080Ti:1,gpu:GeForceGTX1080:1
> NodeName=node-gpu-3 SocketsPerBoard=1 CoresPerSocket=6 ThreadsPerCore=2 
> RealMemory=23000 Gres=gpu:GeForceRTX3080:1
> NodeName=node-gpu-4 SocketsPerBoard=1 CoresPerSocket=4 ThreadsPerCore=2 
> RealMemory=7800 Gres=gpu:GeForceRTX3080:1
> NodeName=node-worker-[0-22] SocketsPerBoard=1 CoresPerSocket=6 
> ThreadsPerCore=2 RealMemory=47000
> # there is no CPUs attribute
> [...]
> 
> 
> With this new configuration, nodes with GPU start correctly slurmd.service 
> daemon, but nodes without GPU (node-worker-[0-22]) can't start slurmd.service 
> daemon and returns this error:
> [...]
> error: Waiting for gres.conf file /dev/nvidia0
> fatal: can't sta

[slurm-users] diagnosing why interactive/non-interactive job waits are so long with State=MIXED

2024-06-04 Thread Robert Kudyba via slurm-users
At the moment we have 2 nodes that are having long wait times. Generally
this is when the nodes are fully allocated. What would be the other reasons
if there is still enough available memory and CPU available, that a
job would take so long? Slurm version is  23.02.4 via Bright Computing.
Note the compute nodes have hyperthreading enabled but that should be
irrelevant. Is there a way to determine what else could be holding jobs up?

srun --pty  -t 0-01:00:00 --nodelist=node001 --gres=gpu:1 -A ourts -p short
/bin/bash
srun: job 672204 queued and waiting for resources

 scontrol show node node001
NodeName=m001 Arch=x86_64 CoresPerSocket=48
   CPUAlloc=24 CPUEfctv=192 CPUTot=192 CPULoad=20.37
   AvailableFeatures=location=local
   ActiveFeatures=location=local
   Gres=gpu:A6000:8
   NodeAddr=node001 NodeHostName=node001 Version=23.02.4
   OS=Linux 5.14.0-70.13.1.el9_0.x86_64 #1 SMP PREEMPT Thu Apr 14 12:42:38
EDT 2022
   RealMemory=1031883 AllocMem=1028096 FreeMem=222528 Sockets=2 Boards=1
   State=MIXED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=ours,short
   BootTime=2024-04-29T16:18:30 SlurmdStartTime=2024-05-18T16:48:11
   LastBusyTime=2024-06-03T10:49:49 ResumeAfterTime=None
   CfgTRES=cpu=192,mem=1031883M,billing=192,gres/gpu=8
   AllocTRES=cpu=24,mem=1004G,gres/gpu=2,gres/gpu:a6000=2
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

grep 672204 /var/log/slurmctld
[2024-06-04T15:50:35.627] sched: _slurm_rpc_allocate_resources JobId=672204
NodeList=(null) usec=852

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: diagnosing why interactive/non-interactive job waits are so long with State=MIXED

2024-06-04 Thread Ryan Novosielski via slurm-users
This is relatively true of my system as well, and I believe it’s that the 
backfill schedule is slower than the main scheduler.

--
#BlackLivesMatter

|| \\UTGERS, |---*O*---
||_// the State  | Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ  | Office of Advanced Research Computing - MSB A555B, Newark
 `'

On Jun 4, 2024, at 16:03, Robert Kudyba via slurm-users 
 wrote:

At the moment we have 2 nodes that are having long wait times. Generally this 
is when the nodes are fully allocated. What would be the other reasons if there 
is still enough available memory and CPU available, that a job would take so 
long? Slurm version is  23.02.4 via Bright Computing. Note the compute nodes 
have hyperthreading enabled but that should be irrelevant. Is there a way to 
determine what else could be holding jobs up?

srun --pty  -t 0-01:00:00 --nodelist=node001 --gres=gpu:1 -A ourts -p short 
/bin/bash
srun: job 672204 queued and waiting for resources

 scontrol show node node001
NodeName=m001 Arch=x86_64 CoresPerSocket=48
   CPUAlloc=24 CPUEfctv=192 CPUTot=192 CPULoad=20.37
   AvailableFeatures=location=local
   ActiveFeatures=location=local
   Gres=gpu:A6000:8
   NodeAddr=node001 NodeHostName=node001 Version=23.02.4
   OS=Linux 5.14.0-70.13.1.el9_0.x86_64 #1 SMP PREEMPT Thu Apr 14 12:42:38 EDT 
2022
   RealMemory=1031883 AllocMem=1028096 FreeMem=222528 Sockets=2 Boards=1
   State=MIXED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=ours,short
   BootTime=2024-04-29T16:18:30 SlurmdStartTime=2024-05-18T16:48:11
   LastBusyTime=2024-06-03T10:49:49 ResumeAfterTime=None
   CfgTRES=cpu=192,mem=1031883M,billing=192,gres/gpu=8
   AllocTRES=cpu=24,mem=1004G,gres/gpu=2,gres/gpu:a6000=2
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

grep 672204 /var/log/slurmctld
[2024-06-04T15:50:35.627] sched: _slurm_rpc_allocate_resources JobId=672204 
NodeList=(null) usec=852

--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: diagnosing why interactive/non-interactive job waits are so long with State=MIXED

2024-06-04 Thread Robert Kudyba via slurm-users
Thanks for the quick response Ryan!

Are there any recommendations for bf_ options from
https://slurm.schedmd.com/sched_config.html that could help with this?
bf_continue? Decreasing bf_interval= to a value lower than 30?

On Tue, Jun 4, 2024 at 4:13 PM Ryan Novosielski 
wrote:

> This is relatively true of my system as well, and I believe it’s that the
> backfill schedule is slower than the main scheduler.
>
> --
> #BlackLivesMatter
> 
> || \\UTGERS, |---*O*---
> ||_// the State  | Ryan Novosielski - novos...@rutgers.edu
> || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
> ||  \\of NJ  | Office of Advanced Research Computing - MSB
> A555B, Newark
>  `'
>
> On Jun 4, 2024, at 16:03, Robert Kudyba via slurm-users <
> slurm-users@lists.schedmd.com> wrote:
>
> At the moment we have 2 nodes that are having long wait times. Generally
> this is when the nodes are fully allocated. What would be the other reasons
> if there is still enough available memory and CPU available, that a
> job would take so long? Slurm version is  23.02.4 via Bright Computing.
> Note the compute nodes have hyperthreading enabled but that should be
> irrelevant. Is there a way to determine what else could be holding jobs up?
>
> srun --pty  -t 0-01:00:00 --nodelist=node001 --gres=gpu:1 -A ourts -p
> short /bin/bash
> srun: job 672204 queued and waiting for resources
>
>  scontrol show node node001
> NodeName=m001 Arch=x86_64 CoresPerSocket=48
>CPUAlloc=24 CPUEfctv=192 CPUTot=192 CPULoad=20.37
>AvailableFeatures=location=local
>ActiveFeatures=location=local
>Gres=gpu:A6000:8
>NodeAddr=node001 NodeHostName=node001 Version=23.02.4
>OS=Linux 5.14.0-70.13.1.el9_0.x86_64 #1 SMP PREEMPT Thu Apr 14 12:42:38
> EDT 2022
>RealMemory=1031883 AllocMem=1028096 FreeMem=222528 Sockets=2 Boards=1
>State=MIXED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
>Partitions=ours,short
>BootTime=2024-04-29T16:18:30 SlurmdStartTime=2024-05-18T16:48:11
>LastBusyTime=2024-06-03T10:49:49 ResumeAfterTime=None
>CfgTRES=cpu=192,mem=1031883M,billing=192,gres/gpu=8
>AllocTRES=cpu=24,mem=1004G,gres/gpu=2,gres/gpu:a6000=2
>CapWatts=n/a
>CurrentWatts=0 AveWatts=0
>ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
>
> grep 672204 /var/log/slurmctld
> [2024-06-04T15:50:35.627] sched: _slurm_rpc_allocate_resources
> JobId=672204 NodeList=(null) usec=852
>
> --
> slurm-users mailing list -- slurm-users@lists.schedmd.com
> To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
>
>
>

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: diagnosing why interactive/non-interactive job waits are so long with State=MIXED

2024-06-04 Thread Ryan Novosielski via slurm-users
We do have bf_continue set. And also bf_max_job_user=50, because we discovered 
that one user can submit so many jobs that it will hit the limit of the number 
it’s going to consider and not run some jobs that it could otherwise run.

On Jun 4, 2024, at 16:20, Robert Kudyba  wrote:

Thanks for the quick response Ryan!

Are there any recommendations for bf_ options from 
https://slurm.schedmd.com/sched_config.html that could help with this? 
bf_continue? Decreasing bf_interval= to a value lower than 30?

On Tue, Jun 4, 2024 at 4:13 PM Ryan Novosielski 
mailto:novos...@rutgers.edu>> wrote:
This is relatively true of my system as well, and I believe it’s that the 
backfill schedule is slower than the main scheduler.

--
#BlackLivesMatter

|| \\UTGERS, |---*O*---
||_// the State  | Ryan Novosielski - 
novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ  | Office of Advanced Research Computing - MSB A555B, Newark
 `'

On Jun 4, 2024, at 16:03, Robert Kudyba via slurm-users 
mailto:slurm-users@lists.schedmd.com>> wrote:

At the moment we have 2 nodes that are having long wait times. Generally this 
is when the nodes are fully allocated. What would be the other reasons if there 
is still enough available memory and CPU available, that a job would take so 
long? Slurm version is  23.02.4 via Bright Computing. Note the compute nodes 
have hyperthreading enabled but that should be irrelevant. Is there a way to 
determine what else could be holding jobs up?

srun --pty  -t 0-01:00:00 --nodelist=node001 --gres=gpu:1 -A ourts -p short 
/bin/bash
srun: job 672204 queued and waiting for resources

 scontrol show node node001
NodeName=m001 Arch=x86_64 CoresPerSocket=48
   CPUAlloc=24 CPUEfctv=192 CPUTot=192 CPULoad=20.37
   AvailableFeatures=location=local
   ActiveFeatures=location=local
   Gres=gpu:A6000:8
   NodeAddr=node001 NodeHostName=node001 Version=23.02.4
   OS=Linux 5.14.0-70.13.1.el9_0.x86_64 #1 SMP PREEMPT Thu Apr 14 12:42:38 EDT 
2022
   RealMemory=1031883 AllocMem=1028096 FreeMem=222528 Sockets=2 Boards=1
   State=MIXED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=ours,short
   BootTime=2024-04-29T16:18:30 SlurmdStartTime=2024-05-18T16:48:11
   LastBusyTime=2024-06-03T10:49:49 ResumeAfterTime=None
   CfgTRES=cpu=192,mem=1031883M,billing=192,gres/gpu=8
   AllocTRES=cpu=24,mem=1004G,gres/gpu=2,gres/gpu:a6000=2
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

grep 672204 /var/log/slurmctld
[2024-06-04T15:50:35.627] sched: _slurm_rpc_allocate_resources JobId=672204 
NodeList=(null) usec=852

--
slurm-users mailing list -- 
slurm-users@lists.schedmd.com
To unsubscribe send an email to 
slurm-users-le...@lists.schedmd.com



-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com