Thanks Yangze, indeed, I see the following in the log about 10s before the final crash (masked some sensitive data using `MASKED`):
2021-08-16 15:58:13.985 [Canceler/Interrupts for Source: MAKSED] WARN org.apache.flink.runtime.taskmanager.Task - Task 'MASKED' did not react to cancelling signal for 30 seconds, but is stuck in method: java.base@11.0.11/jdk.internal.misc.Unsafe.park(Native Method) java.base@11.0.11/java.util.concurrent.locks.LockSupport.park(Unknown Source) java.base@11.0.11/java.util.concurrent.CompletableFuture$Signaller.block(Unknown Source) java.base@11.0.11/java.util.concurrent.ForkJoinPool.managedBlock(Unknown Source) java.base@11.0.11/java.util.concurrent.CompletableFuture.waitingGet(Unknown Source) java.base@11.0.11/java.util.concurrent.CompletableFuture.join(Unknown Source) app//org.apache.flink.streaming.runtime.tasks.StreamTask.cleanUpInvoke(StreamTask.java:705) app//org.apache.flink.streaming.runtime.tasks.SourceStreamTask.cleanUpInvoke(SourceStreamTask.java:186) app//org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:637) app//org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:776) app//org.apache.flink.runtime.taskmanager.Task.run(Task.java:563) java.base@11.0.11/java.lang.Thread.run(Unknown Source) 2021-08-16 15:58:13.986 [Cancellation Watchdog for Source: MASKED] ERROR org.apache.flink.runtime.taskexecutor.TaskManagerRunner - Fatal error occurred while executing the TaskManager. Shutting it down... org.apache.flink.util.FlinkRuntimeException: Task did not exit gracefully within 180 + seconds. at org.apache.flink.runtime.taskmanager.Task$TaskCancelerWatchDog.run(Task.java:1718) at java.base/java.lang.Thread.run(Unknown Source) On Mon, Aug 16, 2021 at 7:05 PM Yangze Guo <karma...@gmail.com> wrote: > Hi, Abhishek, > > Do you see something like "Fatal error occurred while executing the > TaskManager" in your log or would you like to provide the whole task > manager log? > > Best, > Yangze Guo > > On Tue, Aug 17, 2021 at 5:17 AM Abhishek Rai <abhis...@netspring.io> > wrote: > > > > Hello, > > > > In our production environment, running Flink 1.13 (Scala 2.11), where > Flink has been working without issues with a dozen or so jobs running for a > while, Flink taskmanager started crash looping with a period of ~4 minutes > per crash. The stack trace is not very informative, therefore reaching out > for help, see below. > > > > The only other thing that's unusual is that due to what might be a > product issue (custom job code running on Flink), some or all of our tasks > are also in a crash loop. Still, I wasn't expecting taskmanager itself to > die. Does taskmanager have some built in feature to crash if all/most > tasks are crashing? > > > > 2021-08-16 15:58:23.984 [main] ERROR > org.apache.flink.runtime.taskexecutor.TaskManagerRunner - Terminating > TaskManagerRunner with exit code 1. > > org.apache.flink.util.FlinkException: Unexpected failure during runtime > of TaskManagerRunner. > > at > org.apache.flink.runtime.taskexecutor.TaskManagerRunner.runTaskManager(TaskManagerRunner.java:382) > > at > org.apache.flink.runtime.taskexecutor.TaskManagerRunner.lambda$runTaskManagerProcessSecurely$3(TaskManagerRunner.java:413) > > at java.base/java.security.AccessController.doPrivileged(Native Method) > > at java.base/javax.security.auth.Subject.doAs(Unknown Source) > > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1682) > > at > org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) > > at > org.apache.flink.runtime.taskexecutor.TaskManagerRunner.runTaskManagerProcessSecurely(TaskManagerRunner.java:413) > > at > org.apache.flink.runtime.taskexecutor.TaskManagerRunner.runTaskManagerProcessSecurely(TaskManagerRunner.java:396) > > at > org.apache.flink.runtime.taskexecutor.TaskManagerRunner.main(TaskManagerRunner.java:354) > > Caused by: java.util.concurrent.TimeoutException: null > > at > org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:1255) > > at > org.apache.flink.runtime.concurrent.DirectExecutorService.execute(DirectExecutorService.java:217) > > at > org.apache.flink.runtime.concurrent.FutureUtils.lambda$orTimeout$15(FutureUtils.java:582) > > at > java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown > Source) > > at java.base/java.util.concurrent.FutureTask.run(Unknown Source) > > at > java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown > Source) > > at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown > Source) > > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) > > at java.base/java.lang.Thread.run(Unknown Source) > > 2021-08-16 15:58:23.986 [TaskExecutorLocalStateStoresManager shutdown > hook] INFO o.a.flink.runtime.state.TaskExecutorLocalStateStoresManager - > Shutting down TaskExecutorLocalStateStoresManager. > > > > > > Thanks very much! > > > > Abhishek >