Re: Deadlock in SafetyNetCloseableRegistry?

bupt_ljy Tue, 11 Sep 2018 08:50:58 -0700

Hi, all
Sorry for attaching this again. The flink version is 1.6 and the dead lock 
stack is



"CloseableReaperThread" #54 daemon prio=5 os_prio=0 tid=0x00007f4d6d3af000 
nid=0x32f6 in Object.wait() [0x00007f4d3fdfe000]
 java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on 0x00000000aefacb70 (a java.lang.ref.ReferenceQueue$Lock)
at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:143)
- locked 0x00000000aefacb70 (a java.lang.ref.ReferenceQueue$Lock)
at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:164)
at 
org.apache.flink.core.fs.SafetyNetCloseableRegistry$CloseableReaperThread.run(SafetyNetCloseableRegistry.java:193)


   This thread is created in AsyncCheckpointRunnable class and get stucked, so 
the next checkpoint can’t aquire the lock in performCheckpoint method and 
timeout. How can I avoid this?


   Best, Jiayi Liao


Original Message
Sender:bupt_ljybupt_...@163.com
Recipient:useru...@flink.apache.org
Date:Tuesday, Sep 11, 2018 22:22
Subject:Deadlock in SafetyNetCloseableRegistry?


Hi,all
 I starts a flink program and it runs on yarn. At first it doesn’t aquire 
enough resources so this is thrown.
“org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: 
Could not allocate all requires slots within timeout of 300000 ms. Slots 
required: 16, slots allocated: 7”.
 Then the jobmanager automatically restarts but fail to trigger checkpoint 
anymore because “expired before completing”. All the taskmanagers are blocked, 
and I find there seems to be a dead lock inSafetyNetCloseableRegistry, and 
maybe that’s why the whole taskmanager is blocked. Here is the taskmanager’s 
stack:
 
 Best, Jiayi Liao

Re: Deadlock in SafetyNetCloseableRegistry?

Reply via email to