[slurm-users] Understanding fairshare factor

2022-01-12 Thread Michał Kadlof
Hello, I'm trying to understand behavior of fairshare factor. I set a munin monitoring for several accounts and observe the changes in time, and they're not clear for me. A background: My users are split into two groups: sfglab and faculty, in sfglab every one are equal, and in faculty they a

Re: [slurm-users] Slurm and MPICH

2022-01-12 Thread Roger Mason
Hello, "Mccall, Kurt E. (MSFC-EV41)" writes: > MPICH uses the PMI 1 interface by default, but for our 20.02.3 Slurm > installation, “srun –mpi=list yields” > > > > $ srun --mpi=list > > srun: MPI types are... > > srun: cray_shasta > > srun: pmi2 > > srun: none > > > > PMI 2 is there, but no

[slurm-users] Scheduler does not reserve resources

2022-01-12 Thread Jérémy Lapierre
Hi To all slurm users, We have the following issue: jobs with highest priority are pending forever with "Resources" reason. More specifically, the jobs pending forever ask for 2 full nodes but all other jobs from other users (running or pending) need only a 1/4 of a node, then pending jobs ask

Re: [slurm-users] Scheduler does not reserve resources

2022-01-12 Thread Rodrigo Santibáñez
Hi Jeremy, I had a similar behavior a long time ago, and I decided to set SchedulerType=sched/builtin to empty X nodes of jobs and execute that high-priority job requesting more than one node. It is not ideal, but the cluster has low load, so a user that requests more than one node doesn't delay t

[slurm-users] Building Slurm with UCX support

2022-01-12 Thread Matthias Leopold
Hi, I'm compiling Slurm with ansible playbooks from NVIDIA deepops framework (https://github.com/NVIDIA/deepops). I'm trying to add UCX support. How can I tell if UCX is actually included in the resulting binaries (without actually using Slurm)? I was looking at executables and *so files with

Re: [slurm-users] [EXT] Building Slurm with UCX support

2022-01-12 Thread Ozeryan, Vladimir
I am not sure about the rest of the Slurm world, but since I will most likely update OpenMPI more often than Slurm, I've configured and built OpenMPI with UCX and Slurm support and I think they are both default unless you specify "--without" option. Works great so far! -Original Message

[slurm-users] Questions about default_queue_depth

2022-01-12 Thread David Henkemeyer
Hello, A few weeks ago, we tested Slurm against about 50K jobs, and observed at least one instance where a node went idle, while there were jobs on the queue that could have run on the idle node. The best guess as to why this occurred, at this point, is that the default_queue_depth was set to the

Re: [slurm-users] Questions about default_queue_depth

2022-01-12 Thread Renfro, Michael
Not answering every question below, but for (1) we're at 200 on a cluster with a few dozen nodes and around 1k cores, as per https://lists.schedmd.com/pipermail/slurm-users/2021-June/007463.html -- there may be other settings in that email that could be beneficial. We had a lot of idle resource

[slurm-users] big increase of MaxStepCount?

2022-01-12 Thread John R Anderson
hello, a user has requested that we set MaxStepCount to "unlimited" or 16million to accommodate some of their desired workflows. i searched around for details about this parameter & don't see alot, and i reviewed https://bugs.schedmd.com/show_bug.cgi?id=5722 any thoughts on this? can this suc

Re: [slurm-users] Building Slurm with UCX support

2022-01-12 Thread Matthias Leopold
Am 12.01.22 um 17:54 schrieb Matthias Leopold: Hi, I'm compiling Slurm with ansible playbooks from NVIDIA deepops framework (https://github.com/NVIDIA/deepops). I'm trying to add UCX support. How can I tell if UCX is actually included in the resulting binaries (without actually using Slurm

[slurm-users] memory limits:: why job is not killed but oom-killer steps up?

2022-01-12 Thread Adrian Sevcenco
Hi! I have a problem with the enforcing the memory limits... I'm using the cgroup to enforce the limits and i had expected that when cgroup memory limits are reach the job is killed .. instead i see in log a lot of oom-killer reports that act only a certain process from cgroup ... Did i missed

Re: [slurm-users] Questions about default_queue_depth

2022-01-12 Thread Bjørn-Helge Mevik
David Henkemeyer writes: > 3) Is there a way to see the order of the jobs in the queue? Perhaps > squeue lists the jobs in order? squeue -S -p Sort jobs in descending priority order. -- B/H signature.asc Description: PGP signature