On 5/11/20 9:52 am, Alastair Neil wrote:
[2020-05-10T00:26:05.202] [533900.batch] sending
REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 9
This caught my eye, Googling for it found a single instance, from 2019
on the list again about jobs on a node mysteriously dying.
The resolution was (cou
Hmm, works for me. Maybe they added that in more recent versions of slurm.
I'm using version 18+
On Wed, May 13, 2020 at 5:12 PM Alastair Neil wrote:
>
> invalid field requested: "reason"
>
> On Tue, 12 May 2020 at 16:47, Steven Dick wrote:
>>
>> What do you get from
>>
>> sacct -o jobid,elapse
invalid field requested: "reason"
On Tue, 12 May 2020 at 16:47, Steven Dick wrote:
> What do you get from
>
> sacct -o jobid,elapsed,reason,exit -j 533900,533902
>
> On Tue, May 12, 2020 at 4:12 PM Alastair Neil
> wrote:
> >
> > The log is continuous and has all the messages logged by slurmd o
What do you get from
sacct -o jobid,elapsed,reason,exit -j 533900,533902
On Tue, May 12, 2020 at 4:12 PM Alastair Neil wrote:
>
> The log is continuous and has all the messages logged by slurmd on the node
> for all the jobs mentioned, below are the entries from the slurmctld log:
>
>> [2020-0
The log is continuous and has all the messages logged by slurmd on the
node for all the jobs mentioned, below are the entries from the slurmctld
log:
[2020-05-10T00:26:03.097] _slurm_rpc_kill_job: REQUEST_KILL_JOB
> JobId=533898 uid 1224431221
>
[2020-05-10T00:26:03.098] email msg to sshr...@maso
I see one job cancelled and two jobs failed.
Your slurmd log is incomplete -- it doesn't show the two failed jobs
exiting/failing, so the real error is not here.
It might also be helpful to look through slurmctld's log starting from
when the first job was canceled, looking at any messages mentioni
Overzealous node cleanup epilog script?
> On 11 May 2020, at 17:56, Alastair Neil wrote:
>
>
> Hi there,
>
> We are using slurm 18.08 and had a weird occurrence over the weekend. A user
> canceled one of his jobs using scancel, and two additional jobs of the user
> running on the same nod
Hi there,
We are using slurm 18.08 and had a weird occurrence over the weekend. A
user canceled one of his jobs using scancel, and two additional jobs of the
user running on the same node were killed concurrently. The jobs had no
dependency, but they were all allocated 1 gpu. I am curious to kno