That would explain why the job is seen as complete. The step did timeout, but
appears there is no timelimit set. It should be inheriting the allocation
timelimit, no?
sacct -S 071417 -a --format JobID%20,State%20,timelimit,Elapsed,ExitCode -j
1695151
JobID State Timelimit Elapsed ExitCode
-------------------- -------------------- ---------- ---------- --------
1695151 COMPLETED 05:00:00 00:30:36 0:0
1695151.batch COMPLETED 00:30:36 0:0
1695151.extern COMPLETED 00:30:36 0:0
1695151.0 TIMEOUT 00:30:25 0:0
-brian
On 08/29/2017 02:33 PM, Brian W. Johanson wrote:
I user noticed that their job was cancelled earlier than expected.
Therequested timelimit was not honored. The partition does have a default
timelimit of 30:00, this may have been enforced?
We are running slurm 17.02.5. I dug up an old ticket that contained the
sameissue while we were running 15.08, the job asked for 12:00:00 and was
killed at 1:00:00. I don't recall a change to the default timelimit (should
have been 30m then also), so I might be wrong assuming this is enforcing the
partition default.
The job had a timelimit set to 5:00:00, it was cancelled after 30:00due to
time limit. The partition has a default timelimit of 30:00. It is recorded in
the dbas COMPLETED, not TIMEOUT.
Any idea on what would cause this?
$ sacct -S 071417 -X -a --format JobID%20,State%20,timelimit,Elapsed,ExitCode
-j 1695151
JobID State Timelimit Elapsed ExitCode
-------------------- -------------------- ---------- ---------- --------
1695151 COMPLETED 05:00:00 00:30:36 0:0
slurmctld
2017-08-29T09:45:30.868553-04:00 host1 slurmctld 15006 - -
_slurm_rpc_submit_batch_job JobId=1695151 usec=1945
2017-08-29T09:46:52.059115-04:00 host1 slurmctld 15006 - - email msg to
[email protected]: SLURM Job_id=1695151 Name=job1 Began, Queued time 00:01:22
2017-08-29T09:46:52.059478-04:00 host1 slurmctld 15006 - - sched: Allocate
JobID=1695151 NodeList=node1 #CPUs=80 Partition=default
2017-08-29T09:46:52.110696-04:00 host1 slurmctld 15006 - -
prolog_running_decr: Configuration for job 169515 is complete
2017-08-29T09:47:03.624387-04:00 host1 slurmctld 15006 - -
_slurm_rpc_update_job complete JobId=1695151 uid=1 usec=1469
2017-08-29T10:17:28.079554-04:00 host1 slurmctld 15006 - -
check_job_step_time_limit: job 1695151 step 0 has timed out (30)
2017-08-29T10:17:28.441852-04:00 host1 slurmctld 15006 - - job_complete:
JobID=1695151 State=0x1 NodeCnt=1 WEXITSTATUS 0
2017-08-29T10:17:28.442031-04:00 host1 slurmctld 15006 - - email msg to
[email protected]: SLURM Job_id=1695151 Name=job1 Ended, Run time 00:30:36,
COMPLETED, ExitCode 0
2017-08-29T10:17:28.442308-04:00 host1 slurmctld 15006 - - job_complete:
JobID=1695151 State=0x8003 NodeCnt=1 done
compute node slurmd log
[2017-08-29T09:47:03.398] _run_prolog: prolog with lock for job 1695151 ran
for 11 seconds
[2017-08-29T09:47:03.398] Launching batch job 1695151 for UID 1
[2017-08-29T09:47:03.538] [1695151] task/cgroup: /slurm/uid_1/job_1695151:
alloc=3072000MB mem.limit=3034095MB memsw.limit=unlimited
[2017-08-29T09:47:03.538] [1695151] task/cgroup:
/slurm/uid_1/job_1695151/step_batch: alloc=3072000MB mem.limit=3034095MB
memsw.limit=unlimited
[2017-08-29T09:47:03.540] [1695151.4294967295] task/cgroup:
/slurm/uid_1/job_1695151: alloc=3072000MB mem.limit=3034095MB
memsw.limit=unlimited
[2017-08-29T09:47:03.540] [1695151.4294967295] task/cgroup:
/slurm/uid_1/job_1695151/step_extern: alloc=3072000MB mem.limit=3034095MB
memsw.limit=unlimited
[2017-08-29T09:47:03.704] launch task 1695151.0 request from 1.15885@localhost
(port 45283)
[2017-08-29T09:47:03.802] [1695151.0] task/cgroup: /slurm/uid_1/job_1695151:
alloc=3072000MB mem.limit=3034095MB memsw.limit=unlimited
[2017-08-29T09:47:03.802] [1695151.0] task/cgroup:
/slurm/uid_1/job_1695151/step_0: alloc=3072000MB mem.limit=3034095MB
memsw.limit=unlimited
[2017-08-29T10:17:28.188] [1695151.0] error: *** STEP 1695151.0 ON l020
CANCELLED AT 2017-08-29T10:17:28 DUE TO TIME LIMIT ***
[2017-08-29T10:17:28.413] [1695151.0] done with job
[2017-08-29T10:17:28.439] [1695151] sending REQUEST_COMPLETE_BATCH_SCRIPT,
error:0 status 0
[2017-08-29T10:17:28.442] [1695151] done with job
[2017-08-29T10:17:28.477] [1695151.4294967295] done with job