Could you check for that whether the JobManager is also running on the lost Yarn NodeManager? If it is the case, you need to configure "yarn.application-attempts" to a value bigger than 1.
BTW, the logs you provided are not Yarn NodeManager logs. And if you could provide the full jobmanager log, it will help a lot. Best, Yang Rainie Li <raini...@pinterest.com> 于2020年7月22日周三 下午3:54写道: > Hi Flink help, > > I am new to Flink. > I am investigating one flink app that cannot restart when we lose yarn > node manager (tc.yarn.rm.cluster.NumActiveNMs=0), while other flink apps > can restart automatically. > > *Here is job's restartPolicy setting:* > > *env.setRestartStrategy(RestartStrategies.fixedDelayRestart(1000, > org.apache.flink.api.common.time.Time.seconds(30)));* > > *Here is Job Manager log:* > > 2020-07-15 20:26:27,831 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph - Job switched > from state RUNNING to FAILING. > > org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: > Connection unexpectedly closed by remote task manager. This might indicate > that the remote task manager was lost. > > at > org.apache.flink.runtime.io.network.netty.CreditBasedPartitionRequestClientHandler.channelInactive(CreditBasedPartitionRequestClientHandler.java:136) > > at > org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245) > > at > org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231) > > at > org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:224) > > at > org.apache.flink.shaded.netty4.io.netty.handler.codec.ByteToMessageDecoder.channelInputClosed(ByteToMessageDecoder.java:390) > > at > org.apache.flink.shaded.netty4.io.netty.handler.codec.ByteToMessageDecoder.channelInactive(ByteToMessageDecoder.java:355) > > at > org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245) > > at > org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231) > > at > org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:224) > > at > org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1429) > > at > org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245) > > at > org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231) > > at > org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:947) > > at > org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe$8.run(AbstractChannel.java:826) > > at > org.apache.flink.shaded.netty4.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163) > > at > org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:404) > > at > org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:474) > > at > org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:909) > > at java.lang.Thread.run(Thread.java:748) > > > *Here is some yarn node manager log:* > > 2020-07-15 20:57:11.927858: I tensorflow/cc/saved_model/reader.cc:31] Reading > SavedModel from > > 2020-07-15 20:57:11.928419: I tensorflow/cc/saved_model/reader.cc:54] Reading > meta graph with tags > > 2020-07-15 20:57:11.928923: I > tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports > instructions that this TensorFlow binary was not compiled to use: SSE4.1 > SSE4.2 AVX AVX2 FMA > > 2020-07-15 20:57:11.935924: I tensorflow/cc/saved_model/loader.cc:162] > Restoring SavedModel bundle. > > 2020-07-15 20:57:11.939271: I tensorflow/cc/saved_model/loader.cc:138] > Running MainOp with key saved_model_main_op on SavedModel bundle. > > 2020-07-15 20:57:11.944583: I tensorflow/cc/saved_model/loader.cc:259] > SavedModel load for tags; Status: success. Took 16732 microseconds. > > 2020-07-15 20:58:13.356004: F > tensorflow/core/lib/monitoring/collection_registry.cc:77] Cannot register 2 > metrics with the same name: /tensorflow/cc/saved_model/load_attempt_count > > > Any idea why this app's restartPolicy doesn't work? > Thanks > Best regards > Rainie >