Hi Flink help,

I am new to Flink.
I am investigating one flink app that cannot restart when we lose yarn node
manager (tc.yarn.rm.cluster.NumActiveNMs=0), while other flink apps can
restart automatically.

*Here is job's restartPolicy setting:*

*env.setRestartStrategy(RestartStrategies.fixedDelayRestart(1000,
org.apache.flink.api.common.time.Time.seconds(30)));*

*Here is Job Manager log:*

2020-07-15 20:26:27,831 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job
switched from state RUNNING to FAILING.

org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException:
Connection unexpectedly closed by remote task manager. This might
indicate that the remote task manager was lost.

    at 
org.apache.flink.runtime.io.network.netty.CreditBasedPartitionRequestClientHandler.channelInactive(CreditBasedPartitionRequestClientHandler.java:136)

    at 
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)

    at 
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231)

    at 
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:224)

    at 
org.apache.flink.shaded.netty4.io.netty.handler.codec.ByteToMessageDecoder.channelInputClosed(ByteToMessageDecoder.java:390)

    at 
org.apache.flink.shaded.netty4.io.netty.handler.codec.ByteToMessageDecoder.channelInactive(ByteToMessageDecoder.java:355)

    at 
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)

    at 
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231)

    at 
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:224)

    at 
org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1429)

    at 
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)

    at 
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231)

    at 
org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:947)

    at 
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe$8.run(AbstractChannel.java:826)

    at 
org.apache.flink.shaded.netty4.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)

    at 
org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:404)

    at 
org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:474)

    at 
org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:909)

    at java.lang.Thread.run(Thread.java:748)


*Here is some yarn node manager log:*

2020-07-15 20:57:11.927858: I tensorflow/cc/saved_model/reader.cc:31]
Reading SavedModel from

2020-07-15 20:57:11.928419: I tensorflow/cc/saved_model/reader.cc:54]
Reading meta graph with tags

2020-07-15 20:57:11.928923: I
tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports
instructions that this TensorFlow binary was not compiled to use:
SSE4.1 SSE4.2 AVX AVX2 FMA

2020-07-15 20:57:11.935924: I tensorflow/cc/saved_model/loader.cc:162]
Restoring SavedModel bundle.

2020-07-15 20:57:11.939271: I tensorflow/cc/saved_model/loader.cc:138]
Running MainOp with key saved_model_main_op on SavedModel bundle.

2020-07-15 20:57:11.944583: I tensorflow/cc/saved_model/loader.cc:259]
SavedModel load for tags; Status: success. Took 16732 microseconds.

2020-07-15 20:58:13.356004: F
tensorflow/core/lib/monitoring/collection_registry.cc:77] Cannot
register 2 metrics with the same name:
/tensorflow/cc/saved_model/load_attempt_count


Any idea why this app's restartPolicy doesn't work?
Thanks
Best regards
Rainie

Reply via email to