Re: Flink job failure during yarn node termination

2021-08-04 Thread Rainie Li
Hi Nicolaus, I double checked again our hdfs config, it is setting 1 instead of 2. I will try the solution you provided. Thanks again. Best regards Rainie On Wed, Aug 4, 2021 at 10:40 AM Rainie Li wrote: > Thanks for the context Nicolaus. > We are using S3 instead of HDFS. > > Best regards > R

Re: Flink job failure during yarn node termination

2021-08-04 Thread Rainie Li
Thanks for the context Nicolaus. We are using S3 instead of HDFS. Best regards Rainie On Wed, Aug 4, 2021 at 12:39 AM Nicolaus Weidner < nicolaus.weid...@ververica.com> wrote: > Hi Rainie, > > I found a similar issue on stackoverflow, though quite different > stacktrace: > https://stackoverflow.

Re: Flink job failure during yarn node termination

2021-08-04 Thread Rainie Li
Thanks Till. We terminated one of the worker nodes. We enabled HA by using Zookeeper. Sure, we will try upgrade job to newer version. Best regards Rainie On Tue, Aug 3, 2021 at 11:57 PM Till Rohrmann wrote: > Hi Rainie, > > It looks to me as if Yarn is causing this problem. Which Yarn node are

Re: Flink job failure during yarn node termination

2021-08-03 Thread Till Rohrmann
Hi Rainie, It looks to me as if Yarn is causing this problem. Which Yarn node are you terminating? Have you configured your Yarn cluster to be highly available in case you are terminating the ResourceManager? Flink should retry the operation of starting a new container in case it fails. If this i

Flink job failure during yarn node termination

2021-08-03 Thread Rainie Li
Hi Flink Community, My flink application is running version 1.9 and it failed to recover (application was running but checkpoint failed and job stopped to process data) during hadoop yarn node termination. *Here is job manager log error:* *2021-07-26 18:02:58,605 INFO org.apache.hadoop.io.retry.