[ https://issues.apache.org/jira/browse/FLINK-10850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17053971#comment-17053971 ]
Maximilian Michels commented on FLINK-10850: -------------------------------------------- Just to update this here, if there is a OOM when starting the watchdog thread, the exception is not propagated correctly. The reason for this is that the job manager retries the cancelTask() request multiple times. The problem is that the operation is stateful and if we fail to start the watchdog thread, we won't attempt it again as the task already switches to the {{CANCELING}} state before starting the watchdog thread. A fix that works is to guard the {{start()}} method call of the watchdog thread with a try/catch block and issue a fatal task manager shutdown in case we catch an error there. > Job may hang on FAILING state if taskmanager updateTaskExecutionState failed > ---------------------------------------------------------------------------- > > Key: FLINK-10850 > URL: https://issues.apache.org/jira/browse/FLINK-10850 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination > Affects Versions: 1.5.5 > Reporter: ouyangzhe > Priority: Major > > I encountered a job which is oom but hung on FAILING state. It left 3 slots > to release, and the corresponding task state is CANCELING. > I found the following log in the taskmanager, it seems that taskmanager tried > to updateTaskExecutionState from CANCELING to CANCELED, but OOMed. > {noformat} > 2018-11-08 18:01:23,250 INFO org.apache.flink.runtime.taskmanager.Task > - PartialSolution (BulkIteration (Bulk Iteration)) (97/600) > (46005ba837e > fc4ebf783fc92121e55a8) switched from RUNNING to CANCELING. > 2018-11-08 18:01:23,257 INFO org.apache.flink.runtime.taskmanager.Task > - Triggering cancellation of task code PartialSolution > (BulkIteration (B > ulk Iteration)) (97/600) (46005ba837efc4ebf783fc92121e55a8). > 2018-11-08 18:01:44,081 INFO org.apache.flink.runtime.taskmanager.Task > - PartialSolution (BulkIteration (Bulk Iteration)) (97/600) > (46005ba837e > fc4ebf783fc92121e55a8) switched from CANCELING to CANCELED. > 2018-11-08 18:01:44,081 INFO org.apache.flink.runtime.taskmanager.Task > - Freeing task resources for PartialSolution (BulkIteration > (Bulk Iterat > ion)) (97/600) (46005ba837efc4ebf783fc92121e55a8). > 2018-11-08 18:02:03,097 WARN org.apache.flink.runtime.taskmanager.Task > - Task 'PartialSolution (BulkIteration (Bulk Iteration)) > (97/600)' did n > ot react to cancelling signal for 30 seconds, but is stuck in method: > > org.apache.flink.shaded.guava18.com.google.common.collect.Maps$EntryFunction$1.apply(Maps.java:86) > org.apache.flink.shaded.guava18.com.google.common.collect.Iterators$8.transform(Iterators.java:799) > org.apache.flink.shaded.guava18.com.google.common.collect.TransformedIterator.next(TransformedIterator.java:48) > java.util.AbstractCollection.toArray(AbstractCollection.java:141) > org.apache.flink.shaded.guava18.com.google.common.collect.ImmutableList.copyOf(ImmutableList.java:258) > org.apache.flink.runtime.io.network.partition.ResultPartitionManager.releasePartitionsProducedBy(ResultPartitionManager.java:100) > org.apache.flink.runtime.io.network.NetworkEnvironment.unregisterTask(NetworkEnvironment.java:275) > org.apache.flink.runtime.taskmanager.Task.run(Task.java:833) > java.lang.Thread.run(Thread.java:745) > 2018-11-08 18:02:05,665 INFO > org.apache.flink.runtime.taskexecutor.TaskExecutor - Discarding > the results produced by task execution e9141e20871e530dee90 > 4ddce11adca0. > 2018-11-08 18:02:22,536 INFO > org.apache.flink.runtime.taskexecutor.TaskExecutor - Discarding > the results produced by task execution 7fac76a5d76247d803e1 > f1c47a6b385f. > 2018-11-08 18:03:47,210 WARN org.apache.flink.runtime.taskmanager.Task > - Task 'PartialSolution (BulkIteration (Bulk Iteration)) > (97/600)' did n > ot react to cancelling signal for 30 seconds, but is stuck in method: > > org.apache.flink.runtime.memory.MemoryManager.releaseAll(MemoryManager.java:497) > org.apache.flink.runtime.taskmanager.Task.run(Task.java:837) > java.lang.Thread.run(Thread.java:745) > 2018-11-08 18:03:47,213 INFO org.apache.flink.runtime.taskmanager.Task > - Ensuring all FileSystem streams are closed for task > PartialSolution (B > ulkIteration (Bulk Iteration)) (97/600) (46005ba837efc4ebf783fc92121e55a8) > [CANCELED] > 2018-11-08 18:03:47,215 WARN > org.apache.flink.shaded.akka.org.jboss.netty.channel.DefaultChannelPipeline > - An exception was thrown by a user handler while handlin > g an exception event ([id: 0x397132f7, /11.10.199.197:33286 => > /11.9.137.228:40859] EXCEPTION: java.lang.OutOfMemoryError: GC overhead limit > exceeded) > java.lang.OutOfMemoryError: GC overhead limit exceeded > at > org.apache.flink.shaded.akka.org.jboss.netty.buffer.HeapChannelBuffer.<init>(HeapChannelBuffer.java:42) > at > org.apache.flink.shaded.akka.org.jboss.netty.buffer.BigEndianHeapChannelBuffer.<init>(BigEndianHeapChannelBuffer.java:34) > at > org.apache.flink.shaded.akka.org.jboss.netty.buffer.ChannelBuffers.buffer(ChannelBuffers.java:134) > at > org.apache.flink.shaded.akka.org.jboss.netty.buffer.HeapChannelBufferFactory.getBuffer(HeapChannelBufferFactory.java:68) > at > org.apache.flink.shaded.akka.org.jboss.netty.buffer.AbstractChannelBufferFactory.getBuffer(AbstractChannelBufferFactory.java:48) > at > org.apache.flink.shaded.akka.org.jboss.netty.handler.codec.frame.FrameDecoder.extractFrame(FrameDecoder.java:566) > at > org.apache.flink.shaded.akka.org.jboss.netty.handler.codec.frame.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:391) > at > org.apache.flink.shaded.akka.org.jboss.netty.handler.codec.frame.FrameDecoder.callDecode(FrameDecoder.java:425) > at > org.apache.flink.shaded.akka.org.jboss.netty.handler.codec.frame.FrameDecoder.messageReceived(FrameDecoder.java:303) > at > org.apache.flink.shaded.akka.org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70) > at > org.apache.flink.shaded.akka.org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564) > at > org.apache.flink.shaded.akka.org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559) > at > org.apache.flink.shaded.akka.org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:268) > at > org.apache.flink.shaded.akka.org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:255) > at > org.apache.flink.shaded.akka.org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88) > at > org.apache.flink.shaded.akka.org.jboss.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108) > at > org.apache.flink.shaded.akka.org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:337) > at > org.apache.flink.shaded.akka.org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89) > at > org.apache.flink.shaded.akka.org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178) > at > org.apache.flink.shaded.akka.org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108) > at > org.apache.flink.shaded.akka.org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)