Also consider the --no-kill ("-k") options to sbatch (and srun.)
Following from the sbatch man page.
-k, --no-kill [=off]
Do not automatically terminate a job if one of the nodes
it has been allocated fails. The user will
assume the responsibilities for fau
What about instead of (automatic) requeue of the job, use --no-requeue in
the first sbatch and when something went wrong with the job (why not
something wrong with the node?) submit again with --no-requeue the job with
the excluded nodes?
something as: sbatch --no-requeue file.sh, and then sbatch
Not quite.
The user’s job script in question is checking the error status of the program
it ran while it is running. If a program fails the running job wants to exclude
the machine it is currently running on and requeue itself in case it died due
to a local machine issue that the scheduler has
Geoffrey,
A lot depends on what you mean by “failure on the current machine”. If it’s a
failure that Slurm recognizes as a failure, Slurm can be configured to remove
the node from the partition, and you can follow Rodrigo’s suggestions for the
requeue options.
If the user job simply decides it
Hello,
Jobs can be requeue if something wrong happens, and the node with failure
excluded by the controller.
*--requeue* Specifies that the batch job should eligible to being requeue.
The job may be requeued explicitly by a system administrator, after node
failure, or upon preemption by a higher
Hello
We are moving from Univa(sge) to slurm and one of our users has jobs that if
they detect a failure on the current machine they add that machine to their
exclude list and requeue themselves. The user wants to emulate that behavior in
slurm.
It seems like "scontrol update job ${SLURM_JO
Hi Team,
i am seeing a weird issue in my environment.
one of the gaussian job is failing with the slurm within a minute after it
go for the execution without writing anything and unable to figure out the
reason.
The same job works fine without slurm on the same node.
slurmctld.log
[2020-06-03T19
Corey Keasling writes:
> The documentation only refers to GrpGRESRunMins, but I can't figure
> out what I might substitute for GRES that means Memory in the same way
> that substituting CPU means, well, CPUs. Google turns up precisely
> nothing for GrpMemRunMins... Am I missing something?
GrpT
Hello Guys,
We are running GPU clusters with Slurm and SlurmDBD (version 19.05 series) and
some of GPUs seemed to get troubles for attached
jobs. To investigate if the troubles happened on the same GPUs, I'd like to get
GPU indices of the completed jobs.
In my understanding `scontrol show job`