> 2021-08-16 15:58:13.986 [Cancellation Watchdog for Source: MASKED] ERROR 
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner  - Fatal > error 
> occurred while executing the TaskManager. Shutting it down...
> org.apache.flink.util.FlinkRuntimeException: Task did not exit gracefully 
> within 180 + seconds.

It seems the Task 'MASKED' can not be terminated within the timeout. I
think this would be the root cause of TaskManager's termination. We
need to find why Task 'MASKED' has been canceled. Can you provide some
logs related to it? Maybe you can search the "CANCELING" in jm and tm
logs.

Best,
Yangze Guo

On Wed, Aug 18, 2021 at 1:20 AM Abhishek Rai <abhis...@netspring.io> wrote:
>
> Before these message, there is the following message in the log:
>
> 2021-08-12 23:02:58.015 [Canceler/Interrupts for Source: MASKED]) 
> (1/1)#29103' did not react to cancelling signal for 30 seconds, but is stuck 
> in method:
>  java.base@11.0.11/jdk.internal.misc.Unsafe.park(Native Method)
> java.base@11.0.11/java.util.concurrent.locks.LockSupport.parkNanos(Unknown 
> Source)
> java.base@11.0.11/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(Unknown
>  Source)
> app//org.apache.flink.streaming.runtime.tasks.mailbox.TaskMailboxImpl.take(TaskMailboxImpl.java:149)
> app//org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMailsWhenDefaultActionUnavailable(MailboxProcessor.java:341)
> app//org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMail(MailboxProcessor.java:330)
> app//org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:202)
> app//org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:661)
> app//org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:623)
> app//org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:776)
> app//org.apache.flink.runtime.taskmanager.Task.run(Task.java:563)
> java.base@11.0.11/java.lang.Thread.run(Unknown Source)
>
> On Tue, Aug 17, 2021 at 9:22 AM Abhishek Rai <abhis...@netspring.io> wrote:
>>
>> Thanks Yangze, indeed, I see the following in the log about 10s before the 
>> final crash (masked some sensitive data using `MASKED`):
>>
>> 2021-08-16 15:58:13.985 [Canceler/Interrupts for Source: MAKSED] WARN 
>> org.apache.flink.runtime.taskmanager.Task  - Task 'MASKED' did not react to 
>> cancelling signal for 30 seconds, but is stuck in method:
>>  java.base@11.0.11/jdk.internal.misc.Unsafe.park(Native Method)
>> java.base@11.0.11/java.util.concurrent.locks.LockSupport.park(Unknown Source)
>> java.base@11.0.11/java.util.concurrent.CompletableFuture$Signaller.block(Unknown
>>  Source)
>> java.base@11.0.11/java.util.concurrent.ForkJoinPool.managedBlock(Unknown 
>> Source)
>> java.base@11.0.11/java.util.concurrent.CompletableFuture.waitingGet(Unknown 
>> Source)
>> java.base@11.0.11/java.util.concurrent.CompletableFuture.join(Unknown Source)
>> app//org.apache.flink.streaming.runtime.tasks.StreamTask.cleanUpInvoke(StreamTask.java:705)
>> app//org.apache.flink.streaming.runtime.tasks.SourceStreamTask.cleanUpInvoke(SourceStreamTask.java:186)
>> app//org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:637)
>> app//org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:776)
>> app//org.apache.flink.runtime.taskmanager.Task.run(Task.java:563)
>> java.base@11.0.11/java.lang.Thread.run(Unknown Source)
>>
>> 2021-08-16 15:58:13.986 [Cancellation Watchdog for Source: MASKED] ERROR 
>> org.apache.flink.runtime.taskexecutor.TaskManagerRunner  - Fatal error 
>> occurred while executing the TaskManager. Shutting it down...
>> org.apache.flink.util.FlinkRuntimeException: Task did not exit gracefully 
>> within 180 + seconds.
>>   at 
>> org.apache.flink.runtime.taskmanager.Task$TaskCancelerWatchDog.run(Task.java:1718)
>>   at java.base/java.lang.Thread.run(Unknown Source)
>>
>>
>>
>> On Mon, Aug 16, 2021 at 7:05 PM Yangze Guo <karma...@gmail.com> wrote:
>>>
>>> Hi, Abhishek,
>>>
>>> Do you see something like "Fatal error occurred while executing the
>>> TaskManager" in your log or would you like to provide the whole task
>>> manager log?
>>>
>>> Best,
>>> Yangze Guo
>>>
>>> On Tue, Aug 17, 2021 at 5:17 AM Abhishek Rai <abhis...@netspring.io> wrote:
>>> >
>>> > Hello,
>>> >
>>> > In our production environment, running Flink 1.13 (Scala 2.11), where 
>>> > Flink has been working without issues with a dozen or so jobs running for 
>>> > a while, Flink taskmanager started crash looping with a period of ~4 
>>> > minutes per crash.  The stack trace is not very informative, therefore 
>>> > reaching out for help, see below.
>>> >
>>> > The only other thing that's unusual is that due to what might be a 
>>> > product issue (custom job code running on Flink), some or all of our 
>>> > tasks are also in a crash loop.  Still, I wasn't expecting taskmanager 
>>> > itself to die.  Does taskmanager have some built in feature to crash if 
>>> > all/most tasks are crashing?
>>> >
>>> > 2021-08-16 15:58:23.984 [main] ERROR 
>>> > org.apache.flink.runtime.taskexecutor.TaskManagerRunner  - Terminating 
>>> > TaskManagerRunner with exit code 1.
>>> > org.apache.flink.util.FlinkException: Unexpected failure during runtime 
>>> > of TaskManagerRunner.
>>> >   at 
>>> > org.apache.flink.runtime.taskexecutor.TaskManagerRunner.runTaskManager(TaskManagerRunner.java:382)
>>> >   at 
>>> > org.apache.flink.runtime.taskexecutor.TaskManagerRunner.lambda$runTaskManagerProcessSecurely$3(TaskManagerRunner.java:413)
>>> >   at java.base/java.security.AccessController.doPrivileged(Native Method)
>>> >   at java.base/javax.security.auth.Subject.doAs(Unknown Source)
>>> >   at 
>>> > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1682)
>>> >   at 
>>> > org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
>>> >   at 
>>> > org.apache.flink.runtime.taskexecutor.TaskManagerRunner.runTaskManagerProcessSecurely(TaskManagerRunner.java:413)
>>> >   at 
>>> > org.apache.flink.runtime.taskexecutor.TaskManagerRunner.runTaskManagerProcessSecurely(TaskManagerRunner.java:396)
>>> >   at 
>>> > org.apache.flink.runtime.taskexecutor.TaskManagerRunner.main(TaskManagerRunner.java:354)
>>> > Caused by: java.util.concurrent.TimeoutException: null
>>> >   at 
>>> > org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:1255)
>>> >   at 
>>> > org.apache.flink.runtime.concurrent.DirectExecutorService.execute(DirectExecutorService.java:217)
>>> >   at 
>>> > org.apache.flink.runtime.concurrent.FutureUtils.lambda$orTimeout$15(FutureUtils.java:582)
>>> >   at 
>>> > java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown 
>>> > Source)
>>> >   at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
>>> >   at 
>>> > java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown
>>> >  Source)
>>> >   at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown 
>>> > Source)
>>> >   at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown 
>>> > Source)
>>> >   at java.base/java.lang.Thread.run(Unknown Source)
>>> > 2021-08-16 15:58:23.986 [TaskExecutorLocalStateStoresManager shutdown 
>>> > hook] INFO  o.a.flink.runtime.state.TaskExecutorLocalStateStoresManager  
>>> > - Shutting down TaskExecutorLocalStateStoresManager.
>>> >
>>> >
>>> > Thanks very much!
>>> >
>>> > Abhishek

Reply via email to