Hi,
I have a federation of 2 clusters 'merlin5' and 'merlin6'. However for some
reason I have two jobs in a strange state, one in FedJobLock and the second in
Priority (which never gets allocated and I am not able to cancel):
134225868 gpu bash bliven_s PD 0:00 1 (FedJobLock)
134225867 gpu bash bliven_s PD 0:00 1 (Priority)
I try to cancel the jobs, no way. From the Slurm server logs I see the
following:
Merlin5:
[2019-07-02T14:20:14.252] backfill test for JobId=134225868 Prio=3559
Partition=gpu
[2019-07-02T14:20:14.293] backfill: JobId=134225868 can't get fed job lock from
origin cluster to backfill job
[2019-07-02T14:20:14.293] backfill: planned start of JobId=134225868 failed:
Job locked by another sibling
[2019-07-02T14:20:14.293] JobId=134225868 to start at 2019-07-02T14:20:14, end
at 2019-07-07T14:20:00 on nodes merlin-g-01 in partition gpu
[2019-07-02T14:20:14.294] backfill test for JobId=134225867 Prio=3559
Partition=gpu
[2019-07-02T14:20:14.374] backfill: JobId=134225867 can't get fed job lock from
origin cluster to backfill job
[2019-07-02T14:20:14.374] backfill: planned start of JobId=134225867 failed:
Job locked by another sibling
[2019-07-02T14:20:14.374] JobId=134225867 to start at 2019-07-02T14:20:14, end
at 2019-07-07T14:20:00 on nodes merlin-g-04 in partition gpu
[2019-07-02T14:20:14.374] backfill: reached end of job queue
[2019-07-02T14:20:14.374] backfill: completed testing 2(2) jobs, usec=122038
[2019-07-02T14:20:18.052] _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=134225868
uid 0 routed to merlin6
[2019-07-02T14:20:18.052] _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=134225867
uid 0 routed to merlin6
Merlin6:
[2019-07-02T14:20:21.755] backfill: beginning
[2019-07-02T14:20:21.756] backfill: no jobs to backfill
[2019-07-02T14:20:44.415] error: Didn't find JobId=134225868 in fed_job_list
[2019-07-02T14:20:44.456] error: Didn't find JobId=134225867 in fed_job_list
[2019-07-02T14:20:51.756] backfill: beginning
[2019-07-02T14:20:51.756] backfill: no jobs to backfill
[2019-07-02T14:21:09.721] error: Didn't find JobId=134225868 in fed_job_list
[2019-07-02T14:21:14.537] error: Didn't find JobId=134225868 in fed_job_list
[2019-07-02T14:21:14.578] error: Didn't find JobId=134225867 in fed_job_list
While from the accounting server:
bliven_s 134225867 bash gpu PENDING Partition+
Unknown Unknown 00:00:00 1
1 00:00:00 None assigned merlin5
bliven_s 134225868 bash gpu PENDING Partition+
Unknown Unknown 00:00:00 1
1 00:00:00 None assigned merlin5
Any idea how to fix that and what could trigger this?
Thanks a lot,
Marc
_________________________________________________________
Paul Scherrer Institut
High Performance Computing & Emerging Technologies
Marc Caubet Serrabou
Building/Room: OHSA/014
Forschungsstrasse, 111
5232 Villigen PSI
Switzerland
Telephone: +41 56 310 46 67
E-Mail: [email protected]