Before these message, there is the following message in the log: 2021-08-12 23:02:58.015 [Canceler/Interrupts for Source: MASKED]) (1/1)#29103' did not react to cancelling signal for 30 seconds, but is stuck in method: java.base@11.0.11/jdk.internal.misc.Unsafe.park(Native Method) java.base@11.0.11/java.util.concurrent.locks.LockSupport.parkNanos(Unknown Source) java.base@11.0.11/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(Unknown Source) app//org.apache.flink.streaming.runtime.tasks.mailbox.TaskMailboxImpl.take(TaskMailboxImpl.java:149) app//org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMailsWhenDefaultActionUnavailable(MailboxProcessor.java:341) app//org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMail(MailboxProcessor.java:330) app//org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:202) app//org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:661) app//org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:623) app//org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:776) app//org.apache.flink.runtime.taskmanager.Task.run(Task.java:563) java.base@11.0.11/java.lang.Thread.run(Unknown Source)
On Tue, Aug 17, 2021 at 9:22 AM Abhishek Rai <abhis...@netspring.io> wrote: > Thanks Yangze, indeed, I see the following in the log about 10s before the > final crash (masked some sensitive data using `MASKED`): > > 2021-08-16 15:58:13.985 [Canceler/Interrupts for Source: MAKSED] WARN > org.apache.flink.runtime.taskmanager.Task - Task 'MASKED' did not react to > cancelling signal for 30 seconds, but is stuck in method: > java.base@11.0.11/jdk.internal.misc.Unsafe.park(Native Method) > java.base@11.0.11/java.util.concurrent.locks.LockSupport.park(Unknown > Source) > java.base@11.0.11/java.util.concurrent.CompletableFuture$Signaller.block(Unknown > Source) > java.base@11.0.11/java.util.concurrent.ForkJoinPool.managedBlock(Unknown > Source) > java.base@11.0.11/java.util.concurrent.CompletableFuture.waitingGet(Unknown > Source) > java.base@11.0.11/java.util.concurrent.CompletableFuture.join(Unknown > Source) > > app//org.apache.flink.streaming.runtime.tasks.StreamTask.cleanUpInvoke(StreamTask.java:705) > > app//org.apache.flink.streaming.runtime.tasks.SourceStreamTask.cleanUpInvoke(SourceStreamTask.java:186) > > app//org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:637) > app//org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:776) > app//org.apache.flink.runtime.taskmanager.Task.run(Task.java:563) > java.base@11.0.11/java.lang.Thread.run(Unknown Source) > > 2021-08-16 15:58:13.986 [Cancellation Watchdog for Source: MASKED] ERROR > org.apache.flink.runtime.taskexecutor.TaskManagerRunner - Fatal error > occurred while executing the TaskManager. Shutting it down... > org.apache.flink.util.FlinkRuntimeException: Task did not exit gracefully > within 180 + seconds. > at > org.apache.flink.runtime.taskmanager.Task$TaskCancelerWatchDog.run(Task.java:1718) > at java.base/java.lang.Thread.run(Unknown Source) > > > > On Mon, Aug 16, 2021 at 7:05 PM Yangze Guo <karma...@gmail.com> wrote: > >> Hi, Abhishek, >> >> Do you see something like "Fatal error occurred while executing the >> TaskManager" in your log or would you like to provide the whole task >> manager log? >> >> Best, >> Yangze Guo >> >> On Tue, Aug 17, 2021 at 5:17 AM Abhishek Rai <abhis...@netspring.io> >> wrote: >> > >> > Hello, >> > >> > In our production environment, running Flink 1.13 (Scala 2.11), where >> Flink has been working without issues with a dozen or so jobs running for a >> while, Flink taskmanager started crash looping with a period of ~4 minutes >> per crash. The stack trace is not very informative, therefore reaching out >> for help, see below. >> > >> > The only other thing that's unusual is that due to what might be a >> product issue (custom job code running on Flink), some or all of our tasks >> are also in a crash loop. Still, I wasn't expecting taskmanager itself to >> die. Does taskmanager have some built in feature to crash if all/most >> tasks are crashing? >> > >> > 2021-08-16 15:58:23.984 [main] ERROR >> org.apache.flink.runtime.taskexecutor.TaskManagerRunner - Terminating >> TaskManagerRunner with exit code 1. >> > org.apache.flink.util.FlinkException: Unexpected failure during runtime >> of TaskManagerRunner. >> > at >> org.apache.flink.runtime.taskexecutor.TaskManagerRunner.runTaskManager(TaskManagerRunner.java:382) >> > at >> org.apache.flink.runtime.taskexecutor.TaskManagerRunner.lambda$runTaskManagerProcessSecurely$3(TaskManagerRunner.java:413) >> > at java.base/java.security.AccessController.doPrivileged(Native >> Method) >> > at java.base/javax.security.auth.Subject.doAs(Unknown Source) >> > at >> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1682) >> > at >> org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) >> > at >> org.apache.flink.runtime.taskexecutor.TaskManagerRunner.runTaskManagerProcessSecurely(TaskManagerRunner.java:413) >> > at >> org.apache.flink.runtime.taskexecutor.TaskManagerRunner.runTaskManagerProcessSecurely(TaskManagerRunner.java:396) >> > at >> org.apache.flink.runtime.taskexecutor.TaskManagerRunner.main(TaskManagerRunner.java:354) >> > Caused by: java.util.concurrent.TimeoutException: null >> > at >> org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:1255) >> > at >> org.apache.flink.runtime.concurrent.DirectExecutorService.execute(DirectExecutorService.java:217) >> > at >> org.apache.flink.runtime.concurrent.FutureUtils.lambda$orTimeout$15(FutureUtils.java:582) >> > at >> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown >> Source) >> > at java.base/java.util.concurrent.FutureTask.run(Unknown Source) >> > at >> java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown >> Source) >> > at >> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) >> > at >> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) >> > at java.base/java.lang.Thread.run(Unknown Source) >> > 2021-08-16 15:58:23.986 [TaskExecutorLocalStateStoresManager shutdown >> hook] INFO o.a.flink.runtime.state.TaskExecutorLocalStateStoresManager - >> Shutting down TaskExecutorLocalStateStoresManager. >> > >> > >> > Thanks very much! >> > >> > Abhishek >> >