I'm just curious as to what causes a user to decide that a given node has
an issue?
If a node is healthy in all respects, why would a user decide not to use
the node?
Not enough free TMPDIR space, a GPU starts having memory errors, or a machine
with a temporary issue that slurm hea
*Sent:* Thursday, June 4, 2020 4:16 PM
*To:* Slurm User Community List
*Subject:* [EXT] Re: [slurm-users] Change ExcNodeList on a running job
*APL external email warning: *Verify sender
slurm-users-boun...@lists.schedmd.com
<mailto:slurm-users-boun...@lists.schedmd.com> before clicki
ID} ExcNodeList=$NewExcNodeList
>>
>> scontrol requeue ${ SLURM_JOB_ID}
>>
>> sleep 10
>>
>> fi
>>
>>
>>
>>
>>
>>
>>
>> *From:* slurm-users *On Behalf
>> Of *Rodrigo Santibáñez
>> *Sent:* Thursday, J
ol update job ${SLURM_JOB_ID} ExcNodeList=$NewExcNodeList
>
> scontrol requeue ${ SLURM_JOB_ID}
>
> sleep 10
>
> fi
>
>
>
>
>
>
>
> *From:* slurm-users *On Behalf Of
> *Rodrigo Santibáñez
> *Sent:* Thursday, June 4, 2020 4:16 PM
> *To:* Slurm User C
=$NewExcNodeList
scontrol requeue ${ SLURM_JOB_ID}
sleep 10
fi
From: slurm-users On Behalf Of Rodrigo
Santibáñez
Sent: Thursday, June 4, 2020 4:16 PM
To: Slurm User Community List
Subject: [EXT] Re: [slurm-users] Change ExcNodeList on a running job
APL external email warning: Verify sender
slurm
situations where this would be a good solution are
rare!)
Andy
From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of
Rodrigo Santibáñez
Sent: Thursday, June 4, 2020 4:16 PM
To: Slurm User Community List
Subject: Re: [slurm-users] Change ExcNodeList on a running job
Hello
Hello,
Jobs can be requeue if something wrong happens, and the node with failure
excluded by the controller.
*--requeue* Specifies that the batch job should eligible to being requeue.
The job may be requeued explicitly by a system administrator, after node
failure, or upon preemption by a higher
Hello
We are moving from Univa(sge) to slurm and one of our users has jobs that if
they detect a failure on the current machine they add that machine to their
exclude list and requeue themselves. The user wants to emulate that behavior in
slurm.
It seems like "scontrol update job ${SLURM_JO