Hello,

In our production environment, running Flink 1.13 (Scala 2.11), where Flink
has been working without issues with a dozen or so jobs running for a
while, Flink taskmanager started crash looping with a period of ~4 minutes
per crash.  The stack trace is not very informative, therefore reaching out
for help, see below.

The only other thing that's unusual is that due to what might be a product
issue (custom job code running on Flink), some or all of our tasks are also
in a crash loop.  Still, I wasn't expecting taskmanager itself to die.
Does taskmanager have some built in feature to crash if all/most tasks are
crashing?

2021-08-16 15:58:23.984 [main] ERROR
org.apache.flink.runtime.taskexecutor.TaskManagerRunner  - Terminating
TaskManagerRunner with exit code 1.
org.apache.flink.util.FlinkException: Unexpected failure during
runtime of TaskManagerRunner.
  at 
org.apache.flink.runtime.taskexecutor.TaskManagerRunner.runTaskManager(TaskManagerRunner.java:382)
  at 
org.apache.flink.runtime.taskexecutor.TaskManagerRunner.lambda$runTaskManagerProcessSecurely$3(TaskManagerRunner.java:413)
  at java.base/java.security.AccessController.doPrivileged(Native Method)
  at java.base/javax.security.auth.Subject.doAs(Unknown Source)
  at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1682)
  at 
org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
  at 
org.apache.flink.runtime.taskexecutor.TaskManagerRunner.runTaskManagerProcessSecurely(TaskManagerRunner.java:413)
  at 
org.apache.flink.runtime.taskexecutor.TaskManagerRunner.runTaskManagerProcessSecurely(TaskManagerRunner.java:396)
  at 
org.apache.flink.runtime.taskexecutor.TaskManagerRunner.main(TaskManagerRunner.java:354)
Caused by: java.util.concurrent.TimeoutException: null
  at 
org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:1255)
  at 
org.apache.flink.runtime.concurrent.DirectExecutorService.execute(DirectExecutorService.java:217)
  at 
org.apache.flink.runtime.concurrent.FutureUtils.lambda$orTimeout$15(FutureUtils.java:582)
  at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown
Source)
  at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
  at 
java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown
Source)
  at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
  at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown
Source)
  at java.base/java.lang.Thread.run(Unknown Source)
2021-08-16 15:58:23.986 [TaskExecutorLocalStateStoresManager shutdown
hook] INFO  o.a.flink.runtime.state.TaskExecutorLocalStateStoresManager
 - Shutting down TaskExecutorLocalStateStoresManager.


Thanks very much!

Abhishek

Reply via email to