Hi Dongwon, I think the root cause is that GenericCLI do not override the "high-availability.cluster-id" with specified application id. The GenericCLI is activated by "--target yarn-per-job". In the FlinkYarnSessionCli, we have done this. And the following command should work with/without ZooKeeper HA configured.
*./bin/flink list -m yarn-cluster -yid $applicationId* You could also specify the "high-availability.cluster-id" so that leader retrieval could get the correct JobManager address. *flink list --target yarn-per-job -Dyarn.application.id <http://Dyarn.application.id>=$application_id -Dhigh-availability.cluster-id=$application_id* BTW, this is not a new introduced behavior change in Flink 1.12. I believe it also could not work in 1.11 and 1.10. Best, Yang Dongwon Kim <eastcirc...@gmail.com> 于2021年1月5日周二 下午11:22写道: > Hi, > > I'm using Flink-1.12.0 and running on Hadoop YARN. > > After setting HA-related properties in flink-conf.yaml, > > high-availability: zookeeper > > high-availability.zookeeper.path.root: /recovery > > high-availability.zookeeper.quorum: nm01:2181,nm02:2181,nm03:2181 > > high-availability.storageDir: hdfs:///flink/recovery > > the following command hangs and fails: > > $ flink list --target yarn-per-job -Dyarn.application.id=$application_id > > Before setting the properties, I can see the following lines after > executing the above command: > > 2021-01-06 00:11:48,961 INFO > org.apache.flink.runtime.security.modules.HadoopModule > [] - Hadoop user set to deploy (auth:SIMPLE) > > 2021-01-06 00:11:48,968 INFO > org.apache.flink.runtime.security.modules.JaasModule > [] - Jaas file will be created as > /tmp/jaas-8522045433029410483.conf. > > 2021-01-06 00:11:48,976 INFO org.apache.flink.client.cli.CliFrontend > [] - Running 'list' command. > > 2021-01-06 00:11:49,316 INFO org.apache.hadoop.yarn.client.AHSProxy > [] - Connecting to Application History server at nm02/ > 10.93.0.91:10200 > > 2021-01-06 00:11:49,324 INFO org.apache.flink.yarn.YarnClusterDescriptor > [] - No path for the flink jar passed. Using the location > of class org.apache.flink.yarn.YarnClusterDescriptor to locate the jar > > 2021-01-06 00:11:49,333 WARN org.apache.flink.yarn.YarnClusterDescriptor > [] - Neither the HADOOP_CONF_DIR nor the YARN_CONF_DIR > environment variable is set.The Flink YARN Client needs one of these to be > set to properly load the Hadoop configuration for accessing YARN. > > 2021-01-06 00:11:49,404 INFO org.apache.flink.yarn.YarnClusterDescriptor > [] - Found Web Interface dn03:37098 of application > 'application_1600163418174_0127'. > > 2021-01-06 00:11:49,758 INFO org.apache.flink.client.cli.CliFrontend > [] - Waiting for response... > > Waiting for response... > > 2021-01-06 00:11:49,863 INFO org.apache.flink.client.cli.CliFrontend > [] - Successfully retrieved list of jobs > > ------------------ Running/Restarting Jobs ------------------- > > 31.12.2020 01:22:34 : 76fc265c44ef44ae343ab15868155de6 : stream calculator > (RUNNING) > > -------------------------------------------------------------- > > No scheduled jobs. > > After: > > 2021-01-06 00:06:38,971 INFO > org.apache.flink.runtime.security.modules.HadoopModule > [] - Hadoop user set to deploy (auth:SIMPLE) > > 2021-01-06 00:06:38,976 INFO > org.apache.flink.runtime.security.modules.JaasModule > [] - Jaas file will be created as > /tmp/jaas-3613274701724362777.conf. > > 2021-01-06 00:06:38,982 INFO org.apache.flink.client.cli.CliFrontend > [] - Running 'list' command. > > 2021-01-06 00:06:39,304 INFO org.apache.hadoop.yarn.client.AHSProxy > [] - Connecting to Application History server at nm02/ > 10.93.0.91:10200 > > 2021-01-06 00:06:39,312 INFO org.apache.flink.yarn.YarnClusterDescriptor > [] - No path for the flink jar passed. Using the location > of class org.apache.flink.yarn.YarnClusterDescriptor to locate the jar > > 2021-01-06 00:06:39,320 WARN org.apache.flink.yarn.YarnClusterDescriptor > [] - Neither the HADOOP_CONF_DIR nor the YARN_CONF_DIR > environment variable is set.The Flink YARN Client needs one of these to be > set to properly load the Hadoop configuration for accessing YARN. > > 2021-01-06 00:06:39,388 INFO org.apache.flink.yarn.YarnClusterDescriptor > [] - Found Web Interface dn03:37098 of application > 'application_1600163418174_0127'. > > 2021-01-06 00:06:39,399 INFO org.apache.flink.runtime.util.ZooKeeperUtils > [] - Enforcing default ACL for ZK connections > > 2021-01-06 00:06:39,399 INFO org.apache.flink.runtime.util.ZooKeeperUtils > [] - Using '/recovery/default' as Zookeeper namespace. > > 2021-01-06 00:06:39,425 INFO > org.apache.flink.shaded.curator4.org.apache.curator.utils.Compatibility > [] - Running in ZooKeeper 3.4.x compatibility mode > > 2021-01-06 00:06:39,425 INFO > org.apache.flink.shaded.curator4.org.apache.curator.utils.Compatibility > [] - Using emulated InjectSessionExpiration > > 2021-01-06 00:06:39,447 INFO > org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl > [] - Starting > > 2021-01-06 00:06:39,455 INFO > org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ZooKeeper > [] - Initiating client connection, connectString=nm01:2181, > > nm02:2181,nm03:2181 sessionTimeout=60000 > watcher=org.apache.flink.shaded.curator4.org.apache.curator.ConnectionState@7668d560 > > 2021-01-06 00:06:39,466 INFO > org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl > [] - Default schema > > 2021-01-06 00:06:39,466 WARN > org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn > [] - SASL configuration failed: javax.security.auth.login.LoginException: > No JAAS configuration section named 'Client' was found in specified JAAS > configuration file: '/tmp/jaas-3613274701724362777.conf'. Will continue > connection to Zookeeper server without SASL authentication, if Zookeeper > server allows it. > > 2021-01-06 00:06:39,467 INFO > org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn > [] - Opening socket connection to server nm01/10.93.0.32:2181 > > 2021-01-06 00:06:39,467 INFO > org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn > [] - Socket connection established to nm01/10.93.0.32:2181, initiating > session > > 2021-01-06 00:06:39,467 ERROR > org.apache.flink.shaded.curator4.org.apache.curator.ConnectionState [] - > Authentication failed > > 2021-01-06 00:06:39,477 INFO > org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn > [] - Session establishment complete on server nm01/10.93.0.32:2181, > sessionid = 0x176d1f2c2280016, negotiated timeout = 60000 > > 2021-01-06 00:06:39,478 INFO > org.apache.flink.shaded.curator4.org.apache.curator.framework.state.ConnectionStateManager > [] - State change: CONNECTED > > 2021-01-06 00:06:39,658 INFO > org.apache.flink.runtime.leaderretrieval.DefaultLeaderRetrievalService > [] - Starting DefaultLeaderRetrievalService with > ZookeeperLeaderRetrievalDriver{retrievalPath='/leader/rest_server_lock'}. > > 2021-01-06 00:06:39,667 INFO org.apache.flink.client.cli.CliFrontend > [] - Waiting for response... > > Waiting for response... > > > # here it took almost 30 seconds > > > 2021-01-06 00:07:09,670 INFO > org.apache.flink.runtime.leaderretrieval.DefaultLeaderRetrievalService > [] - Stopping DefaultLeaderRetrievalService. > > 2021-01-06 00:07:09,670 INFO > org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalDriver > [] - Closing > ZookeeperLeaderRetrievalDriver{retrievalPath='/leader/rest_server_lock'}. > > 2021-01-06 00:07:09,671 INFO > org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl > [] - backgroundOperationsLoop exiting > > 2021-01-06 00:07:09,679 INFO > org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ZooKeeper > [] - Session: 0x176d1f2c2280016 closed > > 2021-01-06 00:07:09,679 INFO > org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn > [] - EventThread shut down for session: 0x176d1f2c2280016 > > 2021-01-06 00:07:09,680 ERROR org.apache.flink.client.cli.CliFrontend > [] - Error while running the command. > > org.apache.flink.util.FlinkException: Failed to retrieve job list. > > at org.apache.flink.client.cli.CliFrontend.listJobs(CliFrontend.java:436) > ~[flink-dist_2.11-1.12.0.jar:1.12.0] > > at > org.apache.flink.client.cli.CliFrontend.lambda$list$0(CliFrontend.java:418) > ~[flink-dist_2.11-1.12.0.jar:1.12.0] > > at > org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:919) > ~[flink-dist_2.11-1.12.0.jar:1.12.0] > > at org.apache.flink.client.cli.CliFrontend.list(CliFrontend.java:415) > ~[flink-dist_2.11-1.12.0.jar:1.12.0] > > at > org.apache.flink.client.cli.CliFrontend.parseAndRun(CliFrontend.java:977) > ~[flink-dist_2.11-1.12.0.jar:1.12.0] > > at > org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:1047) > ~[flink-dist_2.11-1.12.0.jar:1.12.0] > > at java.security.AccessController.doPrivileged(Native Method) > ~[?:1.8.0_222] > > at javax.security.auth.Subject.doAs(Subject.java:422) [?:1.8.0_222] > > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) > [hadoop-common-3.1.1.3.1.4.0-315.jar:?] > > at > org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) > [flink-dist_2.11-1.12.0.jar:1.12.0] > > at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1047) > [flink-dist_2.11-1.12.0.jar:1.12.0] > > Caused by: java.util.concurrent.TimeoutException > > at > org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:1168) > ~[flink-dist_2.11-1.12.0.jar:1.12.0] > > at > org.apache.flink.runtime.concurrent.DirectExecutorService.execute(DirectExecutorService.java:211) > ~[flink-dist_2.11-1.12.0.jar:1.12.0] > > at > org.apache.flink.runtime.concurrent.FutureUtils.lambda$orTimeout$15(FutureUtils.java:549) > ~[flink-dist_2.11-1.12.0.jar:1.12.0] > > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > ~[?:1.8.0_222] > > at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_222] > > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > ~[?:1.8.0_222] > > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > ~[?:1.8.0_222] > > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > ~[?:1.8.0_222] > > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > ~[?:1.8.0_222] > > at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_222] > > > ------------------------------------------------------------ > > The program finished with the following exception: > > > org.apache.flink.util.FlinkException: Failed to retrieve job list. > > at org.apache.flink.client.cli.CliFrontend.listJobs(CliFrontend.java:436) > > at > org.apache.flink.client.cli.CliFrontend.lambda$list$0(CliFrontend.java:418) > > at > org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:919) > > at org.apache.flink.client.cli.CliFrontend.list(CliFrontend.java:415) > > at > org.apache.flink.client.cli.CliFrontend.parseAndRun(CliFrontend.java:977) > > at > org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:1047) > > at java.security.AccessController.doPrivileged(Native Method) > > at javax.security.auth.Subject.doAs(Subject.java:422) > > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) > > at > org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) > > at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1047) > > Caused by: java.util.concurrent.TimeoutException > > at > org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:1168) > > at > org.apache.flink.runtime.concurrent.DirectExecutorService.execute(DirectExecutorService.java:211) > > at > org.apache.flink.runtime.concurrent.FutureUtils.lambda$orTimeout$15(FutureUtils.java:549) > > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > > at java.lang.Thread.run(Thread.java:748) > > Why is the zookeeper specified for HA used in this process? > > No way to avoid such behavior? > > Best, > > Dongwon > > > >