[slurm-users] Re: Redirect jobs submitted to old partition to new

2024-04-16 Thread Williams, Jenny Avis via slurm-users
For jobs already in default_queue squeue -t pd -h --Format=jobID |xargs -L1 -I{} scontrol update jobID={} partition=queue1 What version of slurm are you running? In slurm 23.02.5, man slurm.conf under PARTITION CONFIGURATION Alternate Partition name of alternate parti

[slurm-users] Re: Slurmd enabled crash with CgroupV2

2024-04-11 Thread Williams, Jenny Avis via slurm-users
The end goal is to see the following 2 things - jobs under the slurmstepd cgroup path, and the cpu,cpuset,memory at least in the cgroup.controllers file within the jobs cgroups.controller list. The pattern you have would be the processes left after boot, first failed slurmd service start which l

[slurm-users] Re: Slurmd enabled crash with CgroupV2

2024-04-11 Thread Williams, Jenny Avis via slurm-users
There needs to be a slurmstepd infinity process running before slurmd starts. This doc goes into it: https://slurm.schedmd.com/cgroup_v2.html Probably a better way to do this, but this is what we do to deal with that: :: files/slurm-cgrepair.service :: [Unit] Before=slurmd

[slurm-users] Re: Avoiding fragmentation

2024-04-10 Thread Williams, Jenny Avis via slurm-users
Various options that might help reduce job fragmentation. Turn up debugging on slurmctld and add the DebugFlags like TraceJobs, SelectType, and Steps. With debugging set high enough one can see a good bit of the logic in regard to node selection. CR_LLN Schedule

[slurm-users] Re: How to reinstall / reconfigure Slurm?

2024-04-03 Thread Williams, Jenny Avis via slurm-users
Slurm source code should be downloaded and recompiled including the configuration flag - with-nvml. As an example, using rpmbuild mechanism for recompiling and generating rpms, this is our current method. Be aware that the compile works only if it finds the prerequisites needed for a given op

[slurm-users] Re: Slurm suspend preemption not working

2024-03-15 Thread Williams, Jenny Avis via slurm-users
CPUs are released, but memory is not released on suspend. Try looking at this output and compare allocated Memory before and after suspending a job on a node: sinfo -N -n yourNode --Format=weight:8,nodelist:15,cpusstate:12,memory:8,allocmem:8 From: Verma, Nischey (HPC ENG,RAL,LSCI) via slurm-u

[slurm-users] Re: Jobs being denied for GrpCpuLimit despite having enough resource

2024-03-14 Thread Williams, Jenny Avis via slurm-users
Also -- scontrol show nodes -Original Message- From: Williams, Jenny Avis Sent: Thursday, March 14, 2024 6:46 PM To: Ole Holm Nielsen ; slurm-users@lists.schedmd.com Subject: RE: [slurm-users] Re: Jobs being denied for GrpCpuLimit despite having enough resource I use an alias slist = ` s

[slurm-users] Re: Jobs being denied for GrpCpuLimit despite having enough resource

2024-03-14 Thread Williams, Jenny Avis via slurm-users
I use an alias slist = ` sed 's/ /\n/g' |sort|uniq` -- do not cp/paste lines with "--" -- it is not the two hyphens intended. The examples below are for slurm 23.02.7 . These commands assume administrator access. This is a generalized set of areas I use to find why things just are not moving

[slurm-users] Re: RHEL 8.9+SLURM-23.11.3+MLNX_OFED_LINUX-23.10-1.1.9.0+ OpenMPI-5.0.2

2024-02-20 Thread Williams, Jenny Avis via slurm-users
How was your binary compiled? If it is dynamically linked, please reply with the ldd listing of the binary ( ldd binary ) Jenny From: S L via slurm-users Sent: Tuesday, February 20, 2024 10:55 AM To: slurm-users@lists.schedmd.com Subject: [slurm-users] RHEL 8.9+SLURM-23.11.3+MLNX_OFED_LINUX-2