Hi Averell, > Is there any way to avoid this? As if I run this as an AWS EMR job, the job > would be considered failed, while it is actually be restored automatically by > YARN after 10 minutes).
You are writing that it takes YARN 10 minutes to restart the application master (AM). However, in my experiments the AM container is restarted within a few seconds when after killing the process. If in your setup YARN actually needs 10 minutes to restart the AM, then you could try increasing the number of retry attempts by the client [2]. > Regarding logging, could you please help explain about the source of the > error messages show in "Exception" tab on Flink Job GUI (as per the > screenshot below). The REST API that is queried by the Web UI returns the root cause from the ExecutionGraph [3]. All job status transitions should be logged together with the exception that caused the transition [4]. Check for INFO level log messages that start with "Job [...] switched from state" followed by a stacktrace. If you cannot find the exception, the problem might be rooted in your log4j or logback configuration. Best, Gary [1] https://github.com/apache/flink/blob/81acd0a490f3ac40cbb2736189796138ac109dd0/flink-clients/src/main/java/org/apache/flink/client/program/rest/RestClusterClient.java#L767 [2] https://ci.apache.org/projects/flink/flink-docs-release-1.7/ops/config.html#rest-retry-max-attempts [3] https://github.com/apache/flink/blob/81acd0a490f3ac40cbb2736189796138ac109dd0/flink-runtime/src/main/java/org/apache/flink/runtime/rest/handler/job/JobExceptionsHandler.java#L87 [4] https://github.com/apache/flink/blob/81acd0a490f3ac40cbb2736189796138ac109dd0/flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/ExecutionGraph.java#L1363 On Fri, Jan 25, 2019 at 12:42 PM Averell <lvhu...@gmail.com> wrote: > Hi Gary, > > Yes, my problem mentioned in the original post had been resolved by > correcting the zookeeper connection string. > > I have two other relevant questions, if you have time, please help: > > 1. Regarding JM high availability, when I shut down the host having JM > running, YARN would detect that missing JM and start a new one after 10 > minutes, and the Flink job would be restored. However, on the console > screen > that I submitted the job, I got the following error messages: "/The program > finished with the following exception: > org.apache.flink.client.program.ProgramInvocationException/" (full stack > trace in the attached file flink_console_timeout.log > < > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t1586/flink_console_timeout.log> > > ) > Is there any way to avoid this? As if I run this as an AWS EMR job, the job > would be considered failed, while it is actually be restored automatically > by YARN after 10 minutes). > > 2. Regarding logging, could you please help explain about the source of the > error messages show in "Exception" tab on Flink Job GUI (as per the > screenshot below). I could not find any log files has that message (not in > jobmanager.log or in taskmanager.log in EMR's hadoop-yarn logs folder). > < > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t1586/Screen_Shot_2019-01-25_at_22.png> > > > Thanks and regards, > Averell > > > > -- > Sent from: > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ >