[ https://issues.apache.org/jira/browse/FLINK-8834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Stephan Ewen closed FLINK-8834. ------------------------------- > Job fails to restart due to some tasks stuck in cancelling state > ---------------------------------------------------------------- > > Key: FLINK-8834 > URL: https://issues.apache.org/jira/browse/FLINK-8834 > Project: Flink > Issue Type: Bug > Affects Versions: 1.4.0 > Environment: AWS EMR 5.12 > Flink 1.4.0 > Beam 2.3.0 > Reporter: Daniel Harper > Priority: Major > Fix For: 1.5.0 > > > Our job threw an exception overnight, causing the job to commence attempting > a restart. > However it never managed to restart because 2 tasks on one of the Task > Managers are stuck in "Cancelling" state, with the following exception > {code:java} > 2018-03-02 02:29:31,604 WARN org.apache.flink.runtime.taskmanager.Task > - Task 'PTransformTranslation.UnknownRawPTransform -> > ParDoTranslation.RawParDo -> ParDoTranslation.RawParDo -> > uk.co.bbc.sawmill.streaming.pipeline.output.io.file.WriteWindowToFile-RDotRecord/TextIO.Write/WriteFiles/GatherTempFileResults/Reshuffle/Window.Into()/Window.Assign.out > -> ParDoTranslation.RawParDo -> ToKeyedWorkItem (24/32)' did not react to > cancelling signal, but is stuck in method: > java.lang.Thread.blockedOn(Thread.java:239) > java.lang.System$2.blockedOn(System.java:1252) > java.nio.channels.spi.AbstractInterruptibleChannel.blockedOn(AbstractInterruptibleChannel.java:211) > java.nio.channels.spi.AbstractInterruptibleChannel.begin(AbstractInterruptibleChannel.java:170) > java.nio.channels.Channels$WritableByteChannelImpl.write(Channels.java:457) > java.nio.channels.Channels.writeFullyImpl(Channels.java:78) > java.nio.channels.Channels.writeFully(Channels.java:101) > java.nio.channels.Channels.access$000(Channels.java:61) > java.nio.channels.Channels$1.write(Channels.java:174) > java.util.zip.DeflaterOutputStream.deflate(DeflaterOutputStream.java:253) > java.util.zip.DeflaterOutputStream.write(DeflaterOutputStream.java:211) > java.util.zip.GZIPOutputStream.write(GZIPOutputStream.java:145) > java.nio.channels.Channels$WritableByteChannelImpl.write(Channels.java:458) > java.nio.channels.Channels.writeFullyImpl(Channels.java:78) > java.nio.channels.Channels.writeFully(Channels.java:101) > java.nio.channels.Channels.access$000(Channels.java:61) > java.nio.channels.Channels$1.write(Channels.java:174) > sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221) > sun.nio.cs.StreamEncoder.implWrite(StreamEncoder.java:282) > sun.nio.cs.StreamEncoder.write(StreamEncoder.java:125) > sun.nio.cs.StreamEncoder.write(StreamEncoder.java:135) > java.io.OutputStreamWriter.write(OutputStreamWriter.java:220) > java.io.Writer.write(Writer.java:157) > org.apache.beam.sdk.io.TextSink$TextWriter.writeLine(TextSink.java:102) > org.apache.beam.sdk.io.TextSink$TextWriter.write(TextSink.java:118) > org.apache.beam.sdk.io.TextSink$TextWriter.write(TextSink.java:76) > org.apache.beam.sdk.io.WriteFiles.writeOrClose(WriteFiles.java:550) > org.apache.beam.sdk.io.WriteFiles.access$1000(WriteFiles.java:112) > org.apache.beam.sdk.io.WriteFiles$WriteShardsIntoTempFilesFn.processElement(WriteFiles.java:718) > org.apache.beam.sdk.io.WriteFiles$WriteShardsIntoTempFilesFn$DoFnInvoker.invokeProcessElement(Unknown > Source) > org.apache.beam.runners.core.SimpleDoFnRunner.invokeProcessElement(SimpleDoFnRunner.java:177) > org.apache.beam.runners.core.SimpleDoFnRunner.processElement(SimpleDoFnRunner.java:138) > org.apache.beam.runners.flink.metrics.DoFnRunnerWithMetricsUpdate.processElement(DoFnRunnerWithMetricsUpdate.java:65) > org.apache.beam.runners.flink.translation.wrappers.streaming.DoFnOperator.processElement(DoFnOperator.java:425) > org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.pushToOperator(OperatorChain.java:549) > org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:524) > org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:504) > org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:831) > org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:809) > org.apache.beam.runners.flink.translation.wrappers.streaming.DoFnOperator$BufferedOutputManager.emit(DoFnOperator.java:888) > org.apache.beam.runners.flink.translation.wrappers.streaming.DoFnOperator$BufferedOutputManager.output(DoFnOperator.java:865) > org.apache.beam.runners.core.GroupAlsoByWindowViaWindowSetNewDoFn$1.outputWindowedValue(GroupAlsoByWindowViaWindowSetNewDoFn.java:94) > org.apache.beam.runners.core.GroupAlsoByWindowViaWindowSetNewDoFn$1.outputWindowedValue(GroupAlsoByWindowViaWindowSetNewDoFn.java:87) > org.apache.beam.runners.core.ReduceFnRunner.lambda$onTrigger$1(ReduceFnRunner.java:1040) > org.apache.beam.runners.core.ReduceFnRunner$$Lambda$163/1408647946.output(Unknown > Source) > org.apache.beam.runners.core.ReduceFnContextFactory$OnTriggerContextImpl.output(ReduceFnContextFactory.java:433) > org.apache.beam.runners.core.SystemReduceFn.onTrigger(SystemReduceFn.java:127) > org.apache.beam.runners.core.ReduceFnRunner.onTrigger(ReduceFnRunner.java:1043) > org.apache.beam.runners.core.ReduceFnRunner.emit(ReduceFnRunner.java:911) > org.apache.beam.runners.core.ReduceFnRunner.onTimers(ReduceFnRunner.java:776) > org.apache.beam.runners.core.GroupAlsoByWindowViaWindowSetNewDoFn.processElement(GroupAlsoByWindowViaWindowSetNewDoFn.java:134) > org.apache.beam.runners.core.GroupAlsoByWindowViaWindowSetNewDoFn$DoFnInvoker.invokeProcessElement(Unknown > Source) > org.apache.beam.runners.core.SimpleDoFnRunner.invokeProcessElement(SimpleDoFnRunner.java:177) > org.apache.beam.runners.core.SimpleDoFnRunner.processElement(SimpleDoFnRunner.java:141) > org.apache.beam.runners.core.LateDataDroppingDoFnRunner.processElement(LateDataDroppingDoFnRunner.java:74) > org.apache.beam.runners.flink.metrics.DoFnRunnerWithMetricsUpdate.processElement(DoFnRunnerWithMetricsUpdate.java:65) > org.apache.beam.runners.flink.translation.wrappers.streaming.WindowDoFnOperator.fireTimer(WindowDoFnOperator.java:108) > org.apache.beam.runners.flink.translation.wrappers.streaming.DoFnOperator.onEventTime(DoFnOperator.java:767) > org.apache.flink.streaming.api.operators.HeapInternalTimerService.advanceWatermark(HeapInternalTimerService.java:277) > org.apache.beam.runners.flink.translation.wrappers.streaming.DoFnOperator.processWatermark1(DoFnOperator.java:532) > org.apache.beam.runners.flink.translation.wrappers.streaming.DoFnOperator.processWatermark(DoFnOperator.java:501) > org.apache.flink.streaming.runtime.io.StreamInputProcessor$ForwardingValveOutputHandler.handleWatermark(StreamInputProcessor.java:288) > org.apache.flink.streaming.runtime.streamstatus.StatusWatermarkValve.findAndOutputNewMinWatermarkAcrossAlignedChannels(StatusWatermarkValve.java:189) > org.apache.flink.streaming.runtime.streamstatus.StatusWatermarkValve.inputWatermark(StatusWatermarkValve.java:111) > org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:189) > org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:69) > org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:264) > org.apache.flink.runtime.taskmanager.Task.run(Task.java:718) > java.lang.Thread.run(Thread.java:748) > 2018-03-02 02:29:32,332 WARN org.apache.flink.runtime.taskmanager.Task > - Task 'PTransformTranslation.UnknownRawPTransform -> > ParDoTranslation.RawParDo -> ParDoTranslation.RawParDo -> > uk.co.bbc.sawmill.streaming.pipeline.output.io.file.WriteWindowToFile-RDotRecord2/TextIO.Write/WriteFiles/GatherTempFileResults/Reshuffle/Window.Into()/Window.Assign.out > -> ParDoTranslation.RawParDo -> ToKeyedWorkItem (22/32)' did not react to > cancelling signal, but is stuck in method: > java.lang.Thread.blockedOn(Thread.java:239) > java.lang.System$2.blockedOn(System.java:1252) > java.nio.channels.spi.AbstractInterruptibleChannel.blockedOn(AbstractInterruptibleChannel.java:211) > java.nio.channels.spi.AbstractInterruptibleChannel.begin(AbstractInterruptibleChannel.java:170) > java.nio.channels.Channels$WritableByteChannelImpl.write(Channels.java:457) > java.nio.channels.Channels.writeFullyImpl(Channels.java:78) > java.nio.channels.Channels.writeFully(Channels.java:101) > java.nio.channels.Channels.access$000(Channels.java:61) > java.nio.channels.Channels$1.write(Channels.java:174) > java.util.zip.DeflaterOutputStream.deflate(DeflaterOutputStream.java:253) > java.util.zip.DeflaterOutputStream.write(DeflaterOutputStream.java:211) > java.util.zip.GZIPOutputStream.write(GZIPOutputStream.java:145) > java.nio.channels.Channels$WritableByteChannelImpl.write(Channels.java:458) > java.nio.channels.Channels.writeFullyImpl(Channels.java:78) > java.nio.channels.Channels.writeFully(Channels.java:101) > java.nio.channels.Channels.access$000(Channels.java:61) > java.nio.channels.Channels$1.write(Channels.java:174) > sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221) > sun.nio.cs.StreamEncoder.implWrite(StreamEncoder.java:282) > sun.nio.cs.StreamEncoder.write(StreamEncoder.java:125) > sun.nio.cs.StreamEncoder.write(StreamEncoder.java:135) > java.io.OutputStreamWriter.write(OutputStreamWriter.java:220) > java.io.Writer.write(Writer.java:157) > org.apache.beam.sdk.io.TextSink$TextWriter.writeLine(TextSink.java:102) > org.apache.beam.sdk.io.TextSink$TextWriter.write(TextSink.java:118) > org.apache.beam.sdk.io.TextSink$TextWriter.write(TextSink.java:76) > org.apache.beam.sdk.io.WriteFiles.writeOrClose(WriteFiles.java:550) > org.apache.beam.sdk.io.WriteFiles.access$1000(WriteFiles.java:112) > org.apache.beam.sdk.io.WriteFiles$WriteShardsIntoTempFilesFn.processElement(WriteFiles.java:718) > org.apache.beam.sdk.io.WriteFiles$WriteShardsIntoTempFilesFn$DoFnInvoker.invokeProcessElement(Unknown > Source) > org.apache.beam.runners.core.SimpleDoFnRunner.invokeProcessElement(SimpleDoFnRunner.java:177) > org.apache.beam.runners.core.SimpleDoFnRunner.processElement(SimpleDoFnRunner.java:138) > org.apache.beam.runners.flink.metrics.DoFnRunnerWithMetricsUpdate.processElement(DoFnRunnerWithMetricsUpdate.java:65) > org.apache.beam.runners.flink.translation.wrappers.streaming.DoFnOperator.processElement(DoFnOperator.java:425) > org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.pushToOperator(OperatorChain.java:549) > org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:524) > org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:504) > org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:831) > org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:809) > org.apache.beam.runners.flink.translation.wrappers.streaming.DoFnOperator$BufferedOutputManager.emit(DoFnOperator.java:888) > org.apache.beam.runners.flink.translation.wrappers.streaming.DoFnOperator$BufferedOutputManager.output(DoFnOperator.java:865) > org.apache.beam.runners.core.GroupAlsoByWindowViaWindowSetNewDoFn$1.outputWindowedValue(GroupAlsoByWindowViaWindowSetNewDoFn.java:94) > org.apache.beam.runners.core.GroupAlsoByWindowViaWindowSetNewDoFn$1.outputWindowedValue(GroupAlsoByWindowViaWindowSetNewDoFn.java:87) > org.apache.beam.runners.core.ReduceFnRunner.lambda$onTrigger$1(ReduceFnRunner.java:1040) > org.apache.beam.runners.core.ReduceFnRunner$$Lambda$163/1408647946.output(Unknown > Source) > org.apache.beam.runners.core.ReduceFnContextFactory$OnTriggerContextImpl.output(ReduceFnContextFactory.java:433) > org.apache.beam.runners.core.SystemReduceFn.onTrigger(SystemReduceFn.java:127) > org.apache.beam.runners.core.ReduceFnRunner.onTrigger(ReduceFnRunner.java:1043) > org.apache.beam.runners.core.ReduceFnRunner.emit(ReduceFnRunner.java:911) > org.apache.beam.runners.core.ReduceFnRunner.onTimers(ReduceFnRunner.java:776) > org.apache.beam.runners.core.GroupAlsoByWindowViaWindowSetNewDoFn.processElement(GroupAlsoByWindowViaWindowSetNewDoFn.java:134) > org.apache.beam.runners.core.GroupAlsoByWindowViaWindowSetNewDoFn$DoFnInvoker.invokeProcessElement(Unknown > Source) > org.apache.beam.runners.core.SimpleDoFnRunner.invokeProcessElement(SimpleDoFnRunner.java:177) > org.apache.beam.runners.core.SimpleDoFnRunner.processElement(SimpleDoFnRunner.java:141) > org.apache.beam.runners.core.LateDataDroppingDoFnRunner.processElement(LateDataDroppingDoFnRunner.java:74) > org.apache.beam.runners.flink.metrics.DoFnRunnerWithMetricsUpdate.processElement(DoFnRunnerWithMetricsUpdate.java:65) > org.apache.beam.runners.flink.translation.wrappers.streaming.WindowDoFnOperator.fireTimer(WindowDoFnOperator.java:108) > org.apache.beam.runners.flink.translation.wrappers.streaming.DoFnOperator.onEventTime(DoFnOperator.java:767) > org.apache.flink.streaming.api.operators.HeapInternalTimerService.advanceWatermark(HeapInternalTimerService.java:277) > org.apache.beam.runners.flink.translation.wrappers.streaming.DoFnOperator.processWatermark1(DoFnOperator.java:532) > org.apache.beam.runners.flink.translation.wrappers.streaming.DoFnOperator.processWatermark(DoFnOperator.java:501) > org.apache.flink.streaming.runtime.io.StreamInputProcessor$ForwardingValveOutputHandler.handleWatermark(StreamInputProcessor.java:288) > org.apache.flink.streaming.runtime.streamstatus.StatusWatermarkValve.findAndOutputNewMinWatermarkAcrossAlignedChannels(StatusWatermarkValve.java:189) > org.apache.flink.streaming.runtime.streamstatus.StatusWatermarkValve.inputWatermark(StatusWatermarkValve.java:111) > org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:189) > org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:69) > org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:264) > org.apache.flink.runtime.taskmanager.Task.run(Task.java:718) > java.lang.Thread.run(Thread.java:748) > {code} > I can see a bit further up in the logs the following exceptions too (although > not sure if they are related) - this exception looks similar to FLINK-8751 > {code:java} > 2018-03-02 02:29:07,094 ERROR > org.apache.flink.streaming.runtime.tasks.StreamTask - Could not > shut down timer service > java.lang.InterruptedException > at > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2067) > at > java.util.concurrent.ThreadPoolExecutor.awaitTermination(ThreadPoolExecutor.java:1475) > at > org.apache.flink.streaming.runtime.tasks.SystemProcessingTimeService.shutdownAndAwaitPending(SystemProcessingTimeService.java:197) > at > org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:317) > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:718){code} > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)