Before these message, there is the following message in the log:

2021-08-12 23:02:58.015 [Canceler/Interrupts for Source: MASKED])
(1/1)#29103' did not react to cancelling signal for 30 seconds, but is
stuck in method:
 java.base@11.0.11/jdk.internal.misc.Unsafe.park(Native Method)
java.base@11.0.11/java.util.concurrent.locks.LockSupport.parkNanos(Unknown
Source)
java.base@11.0.11/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(Unknown
Source)
app//org.apache.flink.streaming.runtime.tasks.mailbox.TaskMailboxImpl.take(TaskMailboxImpl.java:149)
app//org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMailsWhenDefaultActionUnavailable(MailboxProcessor.java:341)
app//org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMail(MailboxProcessor.java:330)
app//org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:202)
app//org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:661)
app//org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:623)
app//org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:776)
app//org.apache.flink.runtime.taskmanager.Task.run(Task.java:563)
java.base@11.0.11/java.lang.Thread.run(Unknown Source)

On Tue, Aug 17, 2021 at 9:22 AM Abhishek Rai <abhis...@netspring.io> wrote:

> Thanks Yangze, indeed, I see the following in the log about 10s before the
> final crash (masked some sensitive data using `MASKED`):
>
> 2021-08-16 15:58:13.985 [Canceler/Interrupts for Source: MAKSED] WARN
> org.apache.flink.runtime.taskmanager.Task  - Task 'MASKED' did not react to
> cancelling signal for 30 seconds, but is stuck in method:
>  java.base@11.0.11/jdk.internal.misc.Unsafe.park(Native Method)
> java.base@11.0.11/java.util.concurrent.locks.LockSupport.park(Unknown
> Source)
> java.base@11.0.11/java.util.concurrent.CompletableFuture$Signaller.block(Unknown
> Source)
> java.base@11.0.11/java.util.concurrent.ForkJoinPool.managedBlock(Unknown
> Source)
> java.base@11.0.11/java.util.concurrent.CompletableFuture.waitingGet(Unknown
> Source)
> java.base@11.0.11/java.util.concurrent.CompletableFuture.join(Unknown
> Source)
>
> app//org.apache.flink.streaming.runtime.tasks.StreamTask.cleanUpInvoke(StreamTask.java:705)
>
> app//org.apache.flink.streaming.runtime.tasks.SourceStreamTask.cleanUpInvoke(SourceStreamTask.java:186)
>
> app//org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:637)
> app//org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:776)
> app//org.apache.flink.runtime.taskmanager.Task.run(Task.java:563)
> java.base@11.0.11/java.lang.Thread.run(Unknown Source)
>
> 2021-08-16 15:58:13.986 [Cancellation Watchdog for Source: MASKED] ERROR
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner  - Fatal error
> occurred while executing the TaskManager. Shutting it down...
> org.apache.flink.util.FlinkRuntimeException: Task did not exit gracefully
> within 180 + seconds.
>   at
> org.apache.flink.runtime.taskmanager.Task$TaskCancelerWatchDog.run(Task.java:1718)
>   at java.base/java.lang.Thread.run(Unknown Source)
>
>
>
> On Mon, Aug 16, 2021 at 7:05 PM Yangze Guo <karma...@gmail.com> wrote:
>
>> Hi, Abhishek,
>>
>> Do you see something like "Fatal error occurred while executing the
>> TaskManager" in your log or would you like to provide the whole task
>> manager log?
>>
>> Best,
>> Yangze Guo
>>
>> On Tue, Aug 17, 2021 at 5:17 AM Abhishek Rai <abhis...@netspring.io>
>> wrote:
>> >
>> > Hello,
>> >
>> > In our production environment, running Flink 1.13 (Scala 2.11), where
>> Flink has been working without issues with a dozen or so jobs running for a
>> while, Flink taskmanager started crash looping with a period of ~4 minutes
>> per crash.  The stack trace is not very informative, therefore reaching out
>> for help, see below.
>> >
>> > The only other thing that's unusual is that due to what might be a
>> product issue (custom job code running on Flink), some or all of our tasks
>> are also in a crash loop.  Still, I wasn't expecting taskmanager itself to
>> die.  Does taskmanager have some built in feature to crash if all/most
>> tasks are crashing?
>> >
>> > 2021-08-16 15:58:23.984 [main] ERROR
>> org.apache.flink.runtime.taskexecutor.TaskManagerRunner  - Terminating
>> TaskManagerRunner with exit code 1.
>> > org.apache.flink.util.FlinkException: Unexpected failure during runtime
>> of TaskManagerRunner.
>> >   at
>> org.apache.flink.runtime.taskexecutor.TaskManagerRunner.runTaskManager(TaskManagerRunner.java:382)
>> >   at
>> org.apache.flink.runtime.taskexecutor.TaskManagerRunner.lambda$runTaskManagerProcessSecurely$3(TaskManagerRunner.java:413)
>> >   at java.base/java.security.AccessController.doPrivileged(Native
>> Method)
>> >   at java.base/javax.security.auth.Subject.doAs(Unknown Source)
>> >   at
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1682)
>> >   at
>> org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
>> >   at
>> org.apache.flink.runtime.taskexecutor.TaskManagerRunner.runTaskManagerProcessSecurely(TaskManagerRunner.java:413)
>> >   at
>> org.apache.flink.runtime.taskexecutor.TaskManagerRunner.runTaskManagerProcessSecurely(TaskManagerRunner.java:396)
>> >   at
>> org.apache.flink.runtime.taskexecutor.TaskManagerRunner.main(TaskManagerRunner.java:354)
>> > Caused by: java.util.concurrent.TimeoutException: null
>> >   at
>> org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:1255)
>> >   at
>> org.apache.flink.runtime.concurrent.DirectExecutorService.execute(DirectExecutorService.java:217)
>> >   at
>> org.apache.flink.runtime.concurrent.FutureUtils.lambda$orTimeout$15(FutureUtils.java:582)
>> >   at
>> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown
>> Source)
>> >   at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
>> >   at
>> java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown
>> Source)
>> >   at
>> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
>> >   at
>> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
>> >   at java.base/java.lang.Thread.run(Unknown Source)
>> > 2021-08-16 15:58:23.986 [TaskExecutorLocalStateStoresManager shutdown
>> hook] INFO  o.a.flink.runtime.state.TaskExecutorLocalStateStoresManager  -
>> Shutting down TaskExecutorLocalStateStoresManager.
>> >
>> >
>> > Thanks very much!
>> >
>> > Abhishek
>>
>

Reply via email to