Hi,

I have a federation of 2 clusters 'merlin5' and 'merlin6'. However for some 
reason I have two jobs in a strange state, one in FedJobLock and the second in 
Priority (which never gets allocated and I am not able to cancel):

         134225868       gpu     bash bliven_s PD       0:00      1 (FedJobLock)
         134225867       gpu     bash bliven_s PD       0:00      1 (Priority)

I try to cancel the jobs, no way. From the Slurm server logs I see the 
following:

Merlin5:

[2019-07-02T14:20:14.252] backfill test for JobId=134225868 Prio=3559 
Partition=gpu
[2019-07-02T14:20:14.293] backfill: JobId=134225868 can't get fed job lock from 
origin cluster to backfill job
[2019-07-02T14:20:14.293] backfill: planned start of JobId=134225868 failed: 
Job locked by another sibling
[2019-07-02T14:20:14.293] JobId=134225868 to start at 2019-07-02T14:20:14, end 
at 2019-07-07T14:20:00 on nodes merlin-g-01 in partition gpu
[2019-07-02T14:20:14.294] backfill test for JobId=134225867 Prio=3559 
Partition=gpu
[2019-07-02T14:20:14.374] backfill: JobId=134225867 can't get fed job lock from 
origin cluster to backfill job
[2019-07-02T14:20:14.374] backfill: planned start of JobId=134225867 failed: 
Job locked by another sibling
[2019-07-02T14:20:14.374] JobId=134225867 to start at 2019-07-02T14:20:14, end 
at 2019-07-07T14:20:00 on nodes merlin-g-04 in partition gpu
[2019-07-02T14:20:14.374] backfill: reached end of job queue
[2019-07-02T14:20:14.374] backfill: completed testing 2(2) jobs, usec=122038
[2019-07-02T14:20:18.052] _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=134225868 
uid 0 routed to merlin6
[2019-07-02T14:20:18.052] _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=134225867 
uid 0 routed to merlin6

Merlin6:

[2019-07-02T14:20:21.755] backfill: beginning
[2019-07-02T14:20:21.756] backfill: no jobs to backfill
[2019-07-02T14:20:44.415] error: Didn't find JobId=134225868 in fed_job_list
[2019-07-02T14:20:44.456] error: Didn't find JobId=134225867 in fed_job_list
[2019-07-02T14:20:51.756] backfill: beginning
[2019-07-02T14:20:51.756] backfill: no jobs to backfill
[2019-07-02T14:21:09.721] error: Didn't find JobId=134225868 in fed_job_list
[2019-07-02T14:21:14.537] error: Didn't find JobId=134225868 in fed_job_list
[2019-07-02T14:21:14.578] error: Didn't find JobId=134225867 in fed_job_list

While from the accounting server:

           bliven_s 134225867          bash        gpu    PENDING Partition+    
         Unknown             Unknown   00:00:00                              1  
        1   00:00:00                  None assigned    merlin5
            bliven_s 134225868          bash        gpu    PENDING Partition+   
          Unknown             Unknown   00:00:00                              1 
         1   00:00:00                  None assigned    merlin5


Any idea how to fix that and what could trigger this?

Thanks a lot,
Marc
_________________________________________________________
Paul Scherrer Institut
High Performance Computing & Emerging Technologies
Marc Caubet Serrabou
Building/Room: OHSA/014
Forschungsstrasse, 111
5232 Villigen PSI
Switzerland

Telephone: +41 56 310 46 67
E-Mail: marc.cau...@psi.ch

Reply via email to