Hi, all Sorry for attaching this again. The flink version is 1.6 and the dead lock stack is
"CloseableReaperThread" #54 daemon prio=5 os_prio=0 tid=0x00007f4d6d3af000 nid=0x32f6 in Object.wait() [0x00007f4d3fdfe000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on 0x00000000aefacb70 (a java.lang.ref.ReferenceQueue$Lock) at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:143) - locked 0x00000000aefacb70 (a java.lang.ref.ReferenceQueue$Lock) at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:164) at org.apache.flink.core.fs.SafetyNetCloseableRegistry$CloseableReaperThread.run(SafetyNetCloseableRegistry.java:193) This thread is created in AsyncCheckpointRunnable class and get stucked, so the next checkpoint can’t aquire the lock in performCheckpoint method and timeout. How can I avoid this? Best, Jiayi Liao Original Message Sender:bupt_ljybupt_...@163.com Recipient:useru...@flink.apache.org Date:Tuesday, Sep 11, 2018 22:22 Subject:Deadlock in SafetyNetCloseableRegistry? Hi,all I starts a flink program and it runs on yarn. At first it doesn’t aquire enough resources so this is thrown. “org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: Could not allocate all requires slots within timeout of 300000 ms. Slots required: 16, slots allocated: 7”. Then the jobmanager automatically restarts but fail to trigger checkpoint anymore because “expired before completing”. All the taskmanagers are blocked, and I find there seems to be a dead lock inSafetyNetCloseableRegistry, and maybe that’s why the whole taskmanager is blocked. Here is the taskmanager’s stack: Best, Jiayi Liao