Hi, I have a federation of 2 clusters 'merlin5' and 'merlin6'. However for some reason I have two jobs in a strange state, one in FedJobLock and the second in Priority (which never gets allocated and I am not able to cancel):
134225868 gpu bash bliven_s PD 0:00 1 (FedJobLock) 134225867 gpu bash bliven_s PD 0:00 1 (Priority) I try to cancel the jobs, no way. From the Slurm server logs I see the following: Merlin5: [2019-07-02T14:20:14.252] backfill test for JobId=134225868 Prio=3559 Partition=gpu [2019-07-02T14:20:14.293] backfill: JobId=134225868 can't get fed job lock from origin cluster to backfill job [2019-07-02T14:20:14.293] backfill: planned start of JobId=134225868 failed: Job locked by another sibling [2019-07-02T14:20:14.293] JobId=134225868 to start at 2019-07-02T14:20:14, end at 2019-07-07T14:20:00 on nodes merlin-g-01 in partition gpu [2019-07-02T14:20:14.294] backfill test for JobId=134225867 Prio=3559 Partition=gpu [2019-07-02T14:20:14.374] backfill: JobId=134225867 can't get fed job lock from origin cluster to backfill job [2019-07-02T14:20:14.374] backfill: planned start of JobId=134225867 failed: Job locked by another sibling [2019-07-02T14:20:14.374] JobId=134225867 to start at 2019-07-02T14:20:14, end at 2019-07-07T14:20:00 on nodes merlin-g-04 in partition gpu [2019-07-02T14:20:14.374] backfill: reached end of job queue [2019-07-02T14:20:14.374] backfill: completed testing 2(2) jobs, usec=122038 [2019-07-02T14:20:18.052] _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=134225868 uid 0 routed to merlin6 [2019-07-02T14:20:18.052] _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=134225867 uid 0 routed to merlin6 Merlin6: [2019-07-02T14:20:21.755] backfill: beginning [2019-07-02T14:20:21.756] backfill: no jobs to backfill [2019-07-02T14:20:44.415] error: Didn't find JobId=134225868 in fed_job_list [2019-07-02T14:20:44.456] error: Didn't find JobId=134225867 in fed_job_list [2019-07-02T14:20:51.756] backfill: beginning [2019-07-02T14:20:51.756] backfill: no jobs to backfill [2019-07-02T14:21:09.721] error: Didn't find JobId=134225868 in fed_job_list [2019-07-02T14:21:14.537] error: Didn't find JobId=134225868 in fed_job_list [2019-07-02T14:21:14.578] error: Didn't find JobId=134225867 in fed_job_list While from the accounting server: bliven_s 134225867 bash gpu PENDING Partition+ Unknown Unknown 00:00:00 1 1 00:00:00 None assigned merlin5 bliven_s 134225868 bash gpu PENDING Partition+ Unknown Unknown 00:00:00 1 1 00:00:00 None assigned merlin5 Any idea how to fix that and what could trigger this? Thanks a lot, Marc _________________________________________________________ Paul Scherrer Institut High Performance Computing & Emerging Technologies Marc Caubet Serrabou Building/Room: OHSA/014 Forschungsstrasse, 111 5232 Villigen PSI Switzerland Telephone: +41 56 310 46 67 E-Mail: marc.cau...@psi.ch