Hi, Abhishek, Do you see something like "Fatal error occurred while executing the TaskManager" in your log or would you like to provide the whole task manager log?
Best, Yangze Guo On Tue, Aug 17, 2021 at 5:17 AM Abhishek Rai <abhis...@netspring.io> wrote: > > Hello, > > In our production environment, running Flink 1.13 (Scala 2.11), where Flink > has been working without issues with a dozen or so jobs running for a while, > Flink taskmanager started crash looping with a period of ~4 minutes per > crash. The stack trace is not very informative, therefore reaching out for > help, see below. > > The only other thing that's unusual is that due to what might be a product > issue (custom job code running on Flink), some or all of our tasks are also > in a crash loop. Still, I wasn't expecting taskmanager itself to die. Does > taskmanager have some built in feature to crash if all/most tasks are > crashing? > > 2021-08-16 15:58:23.984 [main] ERROR > org.apache.flink.runtime.taskexecutor.TaskManagerRunner - Terminating > TaskManagerRunner with exit code 1. > org.apache.flink.util.FlinkException: Unexpected failure during runtime of > TaskManagerRunner. > at > org.apache.flink.runtime.taskexecutor.TaskManagerRunner.runTaskManager(TaskManagerRunner.java:382) > at > org.apache.flink.runtime.taskexecutor.TaskManagerRunner.lambda$runTaskManagerProcessSecurely$3(TaskManagerRunner.java:413) > at java.base/java.security.AccessController.doPrivileged(Native Method) > at java.base/javax.security.auth.Subject.doAs(Unknown Source) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1682) > at > org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) > at > org.apache.flink.runtime.taskexecutor.TaskManagerRunner.runTaskManagerProcessSecurely(TaskManagerRunner.java:413) > at > org.apache.flink.runtime.taskexecutor.TaskManagerRunner.runTaskManagerProcessSecurely(TaskManagerRunner.java:396) > at > org.apache.flink.runtime.taskexecutor.TaskManagerRunner.main(TaskManagerRunner.java:354) > Caused by: java.util.concurrent.TimeoutException: null > at > org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:1255) > at > org.apache.flink.runtime.concurrent.DirectExecutorService.execute(DirectExecutorService.java:217) > at > org.apache.flink.runtime.concurrent.FutureUtils.lambda$orTimeout$15(FutureUtils.java:582) > at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown > Source) > at java.base/java.util.concurrent.FutureTask.run(Unknown Source) > at > java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown > Source) > at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown > Source) > at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown > Source) > at java.base/java.lang.Thread.run(Unknown Source) > 2021-08-16 15:58:23.986 [TaskExecutorLocalStateStoresManager shutdown hook] > INFO o.a.flink.runtime.state.TaskExecutorLocalStateStoresManager - Shutting > down TaskExecutorLocalStateStoresManager. > > > Thanks very much! > > Abhishek