Re: [slurm-users] additional jobs killed by scancel.

2020-05-13 Thread Christopher Samuel
On 5/11/20 9:52 am, Alastair Neil wrote: [2020-05-10T00:26:05.202] [533900.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 9 This caught my eye, Googling for it found a single instance, from 2019 on the list again about jobs on a node mysteriously dying. The resolution was (cou

Re: [slurm-users] additional jobs killed by scancel.

2020-05-13 Thread Steven Dick
Hmm, works for me. Maybe they added that in more recent versions of slurm. I'm using version 18+ On Wed, May 13, 2020 at 5:12 PM Alastair Neil wrote: > > invalid field requested: "reason" > > On Tue, 12 May 2020 at 16:47, Steven Dick wrote: >> >> What do you get from >> >> sacct -o jobid,elapse

Re: [slurm-users] additional jobs killed by scancel.

2020-05-13 Thread Alastair Neil
invalid field requested: "reason" On Tue, 12 May 2020 at 16:47, Steven Dick wrote: > What do you get from > > sacct -o jobid,elapsed,reason,exit -j 533900,533902 > > On Tue, May 12, 2020 at 4:12 PM Alastair Neil > wrote: > > > > The log is continuous and has all the messages logged by slurmd o

Re: [slurm-users] additional jobs killed by scancel.

2020-05-12 Thread Steven Dick
What do you get from sacct -o jobid,elapsed,reason,exit -j 533900,533902 On Tue, May 12, 2020 at 4:12 PM Alastair Neil wrote: > > The log is continuous and has all the messages logged by slurmd on the node > for all the jobs mentioned, below are the entries from the slurmctld log: > >> [2020-0

Re: [slurm-users] additional jobs killed by scancel.

2020-05-12 Thread Alastair Neil
The log is continuous and has all the messages logged by slurmd on the node for all the jobs mentioned, below are the entries from the slurmctld log: [2020-05-10T00:26:03.097] _slurm_rpc_kill_job: REQUEST_KILL_JOB > JobId=533898 uid 1224431221 > [2020-05-10T00:26:03.098] email msg to sshr...@maso

Re: [slurm-users] additional jobs killed by scancel.

2020-05-12 Thread Steven Dick
I see one job cancelled and two jobs failed. Your slurmd log is incomplete -- it doesn't show the two failed jobs exiting/failing, so the real error is not here. It might also be helpful to look through slurmctld's log starting from when the first job was canceled, looking at any messages mentioni

Re: [slurm-users] additional jobs killed by scancel.

2020-05-11 Thread Nathan Harper
Overzealous node cleanup epilog script? > On 11 May 2020, at 17:56, Alastair Neil wrote: > >  > Hi there, > > We are using slurm 18.08 and had a weird occurrence over the weekend. A user > canceled one of his jobs using scancel, and two additional jobs of the user > running on the same nod

[slurm-users] additional jobs killed by scancel.

2020-05-11 Thread Alastair Neil
Hi there, We are using slurm 18.08 and had a weird occurrence over the weekend. A user canceled one of his jobs using scancel, and two additional jobs of the user running on the same node were killed concurrently. The jobs had no dependency, but they were all allocated 1 gpu. I am curious to kno