We're seeing SLURM mis-behaving on one of your clusters, that runs
Ubuntu 22.04.
Ampng other problems, we see an error-message regarding a missing
library version that would have been shipped on Ubuntu 20.04 not 22.04.
It's not clear that the library is being called from a SLURM component
or
The SBATCH_EXCLUSIVE environment-variable is supposed to be equivalent
to using the --exclusive flag on the command-line or in the sbatch-header
*--exclusive*[={user|mcs}]
The job allocation can not share nodes with other running jobs (or
just other users with the "=user" option or with
Our cluster has some nodes separated to their own partition for running
interactive sessions, which are required to be short and only use a few
nodes.
I've always disliked this approach because I see some of the interactive
nodes being idle while other jobs are waiting on the batch partition.
Take a look at this extension to SLURM:
https://github.com/NVIDIA/pyxis
You put the container path on the srun command-line and each rank runs
inside it's own copy of the image.
Subject:[slurm-users] slurm an
In our scaling tests, it's normal to expect the job run-times to reduce
as we increase the node-counts.
Is there a way in SLURM to limit the NODES*TIME product for a partition,
or do we just have to define a different partition (with a different
duration-limit) for each job size?
I put this line in my job-control file (written in bash) to capture the
original as part of the run:
cp $0 $RUNDIR/$SLURM_JOB_NAME
The $0 gives the full path to the working copy of the script, so it
expands to this for example:
/fs/slurm/var/spool/job67842/slurm_script
It depends on t
One of the things I appreciate about SLURM is that I can write simple
statements like this
squeue -t R -a -O
"nodelist:20,jobid:8,username:14,timelimit:14,timeused:12,PARTITION:13,QOS:18,command:0
that shows the list of running jobs along with stats showing when they
can be expected to
I wrote a script that uses "screen" to create side-by-side windows that
run co-operating processes and shows their outputs together.
This looks fine when I run it remotely over an "ssh" connection. (Note
that I don't need to use "ssh -X").
If I run it over an "srun" connection using forms like t