Daniel Harper created FLINK-8834: ------------------------------------ Summary: Job fails to restart due to some tasks stuck in cancelling state Key: FLINK-8834 URL: https://issues.apache.org/jira/browse/FLINK-8834 Project: Flink Issue Type: Bug Affects Versions: 1.4.0 Environment: EMR 5.12
Flink 1.4.0 Beam 2.3.0 Reporter: Daniel Harper Our job threw an exception overnight, causing the job to commence attempting a restart. However it never managed to restart because 2 tasks are stuck in "Cancelling" state, with the following exception {code:java} 2018-03-02 02:29:31,604 WARN org.apache.flink.runtime.taskmanager.Task - Task 'PTransformTranslation.UnknownRawPTransform -> ParDoTranslation.RawParDo -> ParDoTranslation.RawParDo -> uk.co.bbc.sawmill.streaming.pipeline.output.io.file.WriteWindowToFile-RDotRecord/TextIO.Write/WriteFiles/GatherTempFileResults/Reshuffle/Window.Into()/Window.Assign.out -> ParDoTranslation.RawParDo -> ToKeyedWorkItem (24/32)' did not react to cancelling signal, but is stuck in method: java.lang.Thread.blockedOn(Thread.java:239) java.lang.System$2.blockedOn(System.java:1252) java.nio.channels.spi.AbstractInterruptibleChannel.blockedOn(AbstractInterruptibleChannel.java:211) java.nio.channels.spi.AbstractInterruptibleChannel.begin(AbstractInterruptibleChannel.java:170) java.nio.channels.Channels$WritableByteChannelImpl.write(Channels.java:457) java.nio.channels.Channels.writeFullyImpl(Channels.java:78) java.nio.channels.Channels.writeFully(Channels.java:101) java.nio.channels.Channels.access$000(Channels.java:61) java.nio.channels.Channels$1.write(Channels.java:174) java.util.zip.DeflaterOutputStream.deflate(DeflaterOutputStream.java:253) java.util.zip.DeflaterOutputStream.write(DeflaterOutputStream.java:211) java.util.zip.GZIPOutputStream.write(GZIPOutputStream.java:145) java.nio.channels.Channels$WritableByteChannelImpl.write(Channels.java:458) java.nio.channels.Channels.writeFullyImpl(Channels.java:78) java.nio.channels.Channels.writeFully(Channels.java:101) java.nio.channels.Channels.access$000(Channels.java:61) java.nio.channels.Channels$1.write(Channels.java:174) sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221) sun.nio.cs.StreamEncoder.implWrite(StreamEncoder.java:282) sun.nio.cs.StreamEncoder.write(StreamEncoder.java:125) sun.nio.cs.StreamEncoder.write(StreamEncoder.java:135) java.io.OutputStreamWriter.write(OutputStreamWriter.java:220) java.io.Writer.write(Writer.java:157) org.apache.beam.sdk.io.TextSink$TextWriter.writeLine(TextSink.java:102) org.apache.beam.sdk.io.TextSink$TextWriter.write(TextSink.java:118) org.apache.beam.sdk.io.TextSink$TextWriter.write(TextSink.java:76) org.apache.beam.sdk.io.WriteFiles.writeOrClose(WriteFiles.java:550) org.apache.beam.sdk.io.WriteFiles.access$1000(WriteFiles.java:112) org.apache.beam.sdk.io.WriteFiles$WriteShardsIntoTempFilesFn.processElement(WriteFiles.java:718) org.apache.beam.sdk.io.WriteFiles$WriteShardsIntoTempFilesFn$DoFnInvoker.invokeProcessElement(Unknown Source) org.apache.beam.runners.core.SimpleDoFnRunner.invokeProcessElement(SimpleDoFnRunner.java:177) org.apache.beam.runners.core.SimpleDoFnRunner.processElement(SimpleDoFnRunner.java:138) org.apache.beam.runners.flink.metrics.DoFnRunnerWithMetricsUpdate.processElement(DoFnRunnerWithMetricsUpdate.java:65) org.apache.beam.runners.flink.translation.wrappers.streaming.DoFnOperator.processElement(DoFnOperator.java:425) org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.pushToOperator(OperatorChain.java:549) org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:524) org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:504) org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:831) org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:809) org.apache.beam.runners.flink.translation.wrappers.streaming.DoFnOperator$BufferedOutputManager.emit(DoFnOperator.java:888) org.apache.beam.runners.flink.translation.wrappers.streaming.DoFnOperator$BufferedOutputManager.output(DoFnOperator.java:865) org.apache.beam.runners.core.GroupAlsoByWindowViaWindowSetNewDoFn$1.outputWindowedValue(GroupAlsoByWindowViaWindowSetNewDoFn.java:94) org.apache.beam.runners.core.GroupAlsoByWindowViaWindowSetNewDoFn$1.outputWindowedValue(GroupAlsoByWindowViaWindowSetNewDoFn.java:87) org.apache.beam.runners.core.ReduceFnRunner.lambda$onTrigger$1(ReduceFnRunner.java:1040) org.apache.beam.runners.core.ReduceFnRunner$$Lambda$163/1408647946.output(Unknown Source) org.apache.beam.runners.core.ReduceFnContextFactory$OnTriggerContextImpl.output(ReduceFnContextFactory.java:433) org.apache.beam.runners.core.SystemReduceFn.onTrigger(SystemReduceFn.java:127) org.apache.beam.runners.core.ReduceFnRunner.onTrigger(ReduceFnRunner.java:1043) org.apache.beam.runners.core.ReduceFnRunner.emit(ReduceFnRunner.java:911) org.apache.beam.runners.core.ReduceFnRunner.onTimers(ReduceFnRunner.java:776) org.apache.beam.runners.core.GroupAlsoByWindowViaWindowSetNewDoFn.processElement(GroupAlsoByWindowViaWindowSetNewDoFn.java:134) org.apache.beam.runners.core.GroupAlsoByWindowViaWindowSetNewDoFn$DoFnInvoker.invokeProcessElement(Unknown Source) org.apache.beam.runners.core.SimpleDoFnRunner.invokeProcessElement(SimpleDoFnRunner.java:177) org.apache.beam.runners.core.SimpleDoFnRunner.processElement(SimpleDoFnRunner.java:141) org.apache.beam.runners.core.LateDataDroppingDoFnRunner.processElement(LateDataDroppingDoFnRunner.java:74) org.apache.beam.runners.flink.metrics.DoFnRunnerWithMetricsUpdate.processElement(DoFnRunnerWithMetricsUpdate.java:65) org.apache.beam.runners.flink.translation.wrappers.streaming.WindowDoFnOperator.fireTimer(WindowDoFnOperator.java:108) org.apache.beam.runners.flink.translation.wrappers.streaming.DoFnOperator.onEventTime(DoFnOperator.java:767) org.apache.flink.streaming.api.operators.HeapInternalTimerService.advanceWatermark(HeapInternalTimerService.java:277) org.apache.beam.runners.flink.translation.wrappers.streaming.DoFnOperator.processWatermark1(DoFnOperator.java:532) org.apache.beam.runners.flink.translation.wrappers.streaming.DoFnOperator.processWatermark(DoFnOperator.java:501) org.apache.flink.streaming.runtime.io.StreamInputProcessor$ForwardingValveOutputHandler.handleWatermark(StreamInputProcessor.java:288) org.apache.flink.streaming.runtime.streamstatus.StatusWatermarkValve.findAndOutputNewMinWatermarkAcrossAlignedChannels(StatusWatermarkValve.java:189) org.apache.flink.streaming.runtime.streamstatus.StatusWatermarkValve.inputWatermark(StatusWatermarkValve.java:111) org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:189) org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:69) org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:264) org.apache.flink.runtime.taskmanager.Task.run(Task.java:718) java.lang.Thread.run(Thread.java:748) 2018-03-02 02:29:32,332 WARN org.apache.flink.runtime.taskmanager.Task - Task 'PTransformTranslation.UnknownRawPTransform -> ParDoTranslation.RawParDo -> ParDoTranslation.RawParDo -> uk.co.bbc.sawmill.streaming.pipeline.output.io.file.WriteWindowToFile-RDotRecord2/TextIO.Write/WriteFiles/GatherTempFileResults/Reshuffle/Window.Into()/Window.Assign.out -> ParDoTranslation.RawParDo -> ToKeyedWorkItem (22/32)' did not react to cancelling signal, but is stuck in method: java.lang.Thread.blockedOn(Thread.java:239) java.lang.System$2.blockedOn(System.java:1252) java.nio.channels.spi.AbstractInterruptibleChannel.blockedOn(AbstractInterruptibleChannel.java:211) java.nio.channels.spi.AbstractInterruptibleChannel.begin(AbstractInterruptibleChannel.java:170) java.nio.channels.Channels$WritableByteChannelImpl.write(Channels.java:457) java.nio.channels.Channels.writeFullyImpl(Channels.java:78) java.nio.channels.Channels.writeFully(Channels.java:101) java.nio.channels.Channels.access$000(Channels.java:61) java.nio.channels.Channels$1.write(Channels.java:174) java.util.zip.DeflaterOutputStream.deflate(DeflaterOutputStream.java:253) java.util.zip.DeflaterOutputStream.write(DeflaterOutputStream.java:211) java.util.zip.GZIPOutputStream.write(GZIPOutputStream.java:145) java.nio.channels.Channels$WritableByteChannelImpl.write(Channels.java:458) java.nio.channels.Channels.writeFullyImpl(Channels.java:78) java.nio.channels.Channels.writeFully(Channels.java:101) java.nio.channels.Channels.access$000(Channels.java:61) java.nio.channels.Channels$1.write(Channels.java:174) sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221) sun.nio.cs.StreamEncoder.implWrite(StreamEncoder.java:282) sun.nio.cs.StreamEncoder.write(StreamEncoder.java:125) sun.nio.cs.StreamEncoder.write(StreamEncoder.java:135) java.io.OutputStreamWriter.write(OutputStreamWriter.java:220) java.io.Writer.write(Writer.java:157) org.apache.beam.sdk.io.TextSink$TextWriter.writeLine(TextSink.java:102) org.apache.beam.sdk.io.TextSink$TextWriter.write(TextSink.java:118) org.apache.beam.sdk.io.TextSink$TextWriter.write(TextSink.java:76) org.apache.beam.sdk.io.WriteFiles.writeOrClose(WriteFiles.java:550) org.apache.beam.sdk.io.WriteFiles.access$1000(WriteFiles.java:112) org.apache.beam.sdk.io.WriteFiles$WriteShardsIntoTempFilesFn.processElement(WriteFiles.java:718) org.apache.beam.sdk.io.WriteFiles$WriteShardsIntoTempFilesFn$DoFnInvoker.invokeProcessElement(Unknown Source) org.apache.beam.runners.core.SimpleDoFnRunner.invokeProcessElement(SimpleDoFnRunner.java:177) org.apache.beam.runners.core.SimpleDoFnRunner.processElement(SimpleDoFnRunner.java:138) org.apache.beam.runners.flink.metrics.DoFnRunnerWithMetricsUpdate.processElement(DoFnRunnerWithMetricsUpdate.java:65) org.apache.beam.runners.flink.translation.wrappers.streaming.DoFnOperator.processElement(DoFnOperator.java:425) org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.pushToOperator(OperatorChain.java:549) org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:524) org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:504) org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:831) org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:809) org.apache.beam.runners.flink.translation.wrappers.streaming.DoFnOperator$BufferedOutputManager.emit(DoFnOperator.java:888) org.apache.beam.runners.flink.translation.wrappers.streaming.DoFnOperator$BufferedOutputManager.output(DoFnOperator.java:865) org.apache.beam.runners.core.GroupAlsoByWindowViaWindowSetNewDoFn$1.outputWindowedValue(GroupAlsoByWindowViaWindowSetNewDoFn.java:94) org.apache.beam.runners.core.GroupAlsoByWindowViaWindowSetNewDoFn$1.outputWindowedValue(GroupAlsoByWindowViaWindowSetNewDoFn.java:87) org.apache.beam.runners.core.ReduceFnRunner.lambda$onTrigger$1(ReduceFnRunner.java:1040) org.apache.beam.runners.core.ReduceFnRunner$$Lambda$163/1408647946.output(Unknown Source) org.apache.beam.runners.core.ReduceFnContextFactory$OnTriggerContextImpl.output(ReduceFnContextFactory.java:433) org.apache.beam.runners.core.SystemReduceFn.onTrigger(SystemReduceFn.java:127) org.apache.beam.runners.core.ReduceFnRunner.onTrigger(ReduceFnRunner.java:1043) org.apache.beam.runners.core.ReduceFnRunner.emit(ReduceFnRunner.java:911) org.apache.beam.runners.core.ReduceFnRunner.onTimers(ReduceFnRunner.java:776) org.apache.beam.runners.core.GroupAlsoByWindowViaWindowSetNewDoFn.processElement(GroupAlsoByWindowViaWindowSetNewDoFn.java:134) org.apache.beam.runners.core.GroupAlsoByWindowViaWindowSetNewDoFn$DoFnInvoker.invokeProcessElement(Unknown Source) org.apache.beam.runners.core.SimpleDoFnRunner.invokeProcessElement(SimpleDoFnRunner.java:177) org.apache.beam.runners.core.SimpleDoFnRunner.processElement(SimpleDoFnRunner.java:141) org.apache.beam.runners.core.LateDataDroppingDoFnRunner.processElement(LateDataDroppingDoFnRunner.java:74) org.apache.beam.runners.flink.metrics.DoFnRunnerWithMetricsUpdate.processElement(DoFnRunnerWithMetricsUpdate.java:65) org.apache.beam.runners.flink.translation.wrappers.streaming.WindowDoFnOperator.fireTimer(WindowDoFnOperator.java:108) org.apache.beam.runners.flink.translation.wrappers.streaming.DoFnOperator.onEventTime(DoFnOperator.java:767) org.apache.flink.streaming.api.operators.HeapInternalTimerService.advanceWatermark(HeapInternalTimerService.java:277) org.apache.beam.runners.flink.translation.wrappers.streaming.DoFnOperator.processWatermark1(DoFnOperator.java:532) org.apache.beam.runners.flink.translation.wrappers.streaming.DoFnOperator.processWatermark(DoFnOperator.java:501) org.apache.flink.streaming.runtime.io.StreamInputProcessor$ForwardingValveOutputHandler.handleWatermark(StreamInputProcessor.java:288) org.apache.flink.streaming.runtime.streamstatus.StatusWatermarkValve.findAndOutputNewMinWatermarkAcrossAlignedChannels(StatusWatermarkValve.java:189) org.apache.flink.streaming.runtime.streamstatus.StatusWatermarkValve.inputWatermark(StatusWatermarkValve.java:111) org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:189) org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:69) org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:264) org.apache.flink.runtime.taskmanager.Task.run(Task.java:718) java.lang.Thread.run(Thread.java:748) {code} I can see a bit further up in the logs the following exceptions too (although not sure if they are related) - this exception looks similar to FLINK-8751 {code:java} 2018-03-02 02:29:07,094 ERROR org.apache.flink.streaming.runtime.tasks.StreamTask - Could not shut down timer service java.lang.InterruptedException at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2067) at java.util.concurrent.ThreadPoolExecutor.awaitTermination(ThreadPoolExecutor.java:1475) at org.apache.flink.streaming.runtime.tasks.SystemProcessingTimeService.shutdownAndAwaitPending(SystemProcessingTimeService.java:197) at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:317) at org.apache.flink.runtime.taskmanager.Task.run(Task.java:718){code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)