Zakelly Lan created FLINK-17645: ----------------------------------- Summary: REAPER_THREAD in SafetyNetCloseableRegistry start() failed, causing the repeated failover. Key: FLINK-17645 URL: https://issues.apache.org/jira/browse/FLINK-17645 Project: Flink Issue Type: Bug Components: Runtime / Task Affects Versions: 1.6.3 Reporter: Zakelly Lan
I'm running a modified version of Flink, and encountered the exception below when task start: {code:java} 2020-05-12 00:46:19,037 ERROR [***] org.apache.flink.runtime.taskmanager.Task - Encountered an unexpected exception java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:802) at org.apache.flink.core.fs.SafetyNetCloseableRegistry.<init>(SafetyNetCloseableRegistry.java:73) at org.apache.flink.core.fs.FileSystemSafetyNet.initializeSafetyNetForThread(FileSystemSafetyNet.java:89) at org.apache.flink.runtime.taskmanager.Task.run(Task.java:586) at java.lang.Thread.run(Thread.java:834) 2020-05-12 00:46:19,038 INFO [***] org.apache.flink.runtime.taskmanager.Task java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:802) at org.apache.flink.core.fs.SafetyNetCloseableRegistry.<init>(SafetyNetCloseableRegistry.java:73) at org.apache.flink.core.fs.FileSystemSafetyNet.initializeSafetyNetForThread(FileSystemSafetyNet.java:89) at org.apache.flink.runtime.taskmanager.Task.run(Task.java:586) at java.lang.Thread.run(Thread.java:834) {code} The REAPER_THREAD.start() fails because of OOM, and REAPER_THREAD will never be null. Since then, every time SafetyNetCloseableRegistry init in this VM will cause an IllegalStateException: {code:java} java.lang.IllegalStateException at org.apache.flink.util.Preconditions.checkState(Preconditions.java:179) at org.apache.flink.core.fs.SafetyNetCloseableRegistry.<init>(SafetyNetCloseableRegistry.java:71) at org.apache.flink.core.fs.FileSystemSafetyNet.initializeSafetyNetForThread(FileSystemSafetyNet.java:89) at org.apache.flink.runtime.taskmanager.Task.run(Task.java:586) at java.lang.Thread.run(Thread.java:834){code} This may happen in very old version of Flink as well. -- This message was sent by Atlassian Jira (v8.3.4#803005)