Thanks Piotr. This is helpful. Thomas
On Mon, Jun 28, 2021 at 8:29 AM Piotr Nowojski <pnowoj...@apache.org> wrote: > Hi, > > You should still be able to get the Flink logs via: > > > yarn logs -applicationId application_1623861596410_0010 > > And it should give you more answers about what has happened. > > About the Flink and YARN behaviour, have you seen the documentation? [1] > Especially this part: > > > Failed containers (including the JobManager) are replaced by YARN. The > maximum number of JobManager container restarts is configured via > yarn.application-attempts (default 1). The YARN Application will fail once > all attempts are exhausted. > > ? > > Best, > Piotrek > > [1] > https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/resource-providers/yarn/#flink-on-yarn-reference > > pon., 28 cze 2021 o 02:26 Thomas Wang <w...@datability.io> napisaĆ(a): > >> Just found some additional info. It looks like one of the EC2 instances >> got terminated at the time the crash happened and this job had 7 Task >> Managers running on that EC2 instance. Now I suspect it's possible >> that when Yarn tried to migrate the Task Managers, there were no idle >> containers as this job was using like 99% of the entire cluster. However in >> that case shouldn't Yarn wait for containers to become available? I'm not >> quite sure how Flink would behave in this case. Could someone provide some >> insights here? Thanks. >> >> Thomas >> >> On Sun, Jun 27, 2021 at 4:24 PM Thomas Wang <w...@datability.io> wrote: >> >>> Hi, >>> >>> I recently experienced a job crash due to the underlying Yarn >>> application failing for some reason. Here is the only error message I saw. >>> It seems I can no longer see any of the Flink job logs. >>> >>> Application application_1623861596410_0010 failed 1 times (global limit >>> =2; local limit is =1) due to ApplicationMaster for attempt >>> appattempt_1623861596410_0010_000001 timed out. Failing the application. >>> >>> I was running the Flink job using the Yarn session mode with the >>> following command. >>> >>> export HADOOP_CLASSPATH=`hadoop classpath` && >>> /usr/lib/flink/bin/yarn-session.sh -jm 7g -tm 7g -s 4 --detached >>> >>> I didn't have HA setup, but I believe the underlying Yarn application >>> caused the crash because if, for some reason, the Flink job failed, the >>> Yarn application should still survive. Please correct me if this is not the >>> right assumption. >>> >>> My question is how I should find the root cause in this case and what's >>> the recommended way to avoid this going forward? >>> >>> Thanks. >>> >>> Thomas >>> >>