Hi,

I'm using Flink-1.12.0 and running on Hadoop YARN.

After setting HA-related properties in flink-conf.yaml,

high-availability: zookeeper

high-availability.zookeeper.path.root: /recovery

high-availability.zookeeper.quorum: nm01:2181,nm02:2181,nm03:2181

high-availability.storageDir: hdfs:///flink/recovery

the following command hangs and fails:

$ flink list --target yarn-per-job -Dyarn.application.id=$application_id

Before setting the properties, I can see the following lines after
executing the above command:

2021-01-06 00:11:48,961 INFO
org.apache.flink.runtime.security.modules.HadoopModule
      [] - Hadoop user set to deploy (auth:SIMPLE)

2021-01-06 00:11:48,968 INFO
org.apache.flink.runtime.security.modules.JaasModule
        [] - Jaas file will be created as
/tmp/jaas-8522045433029410483.conf.

2021-01-06 00:11:48,976 INFO  org.apache.flink.client.cli.CliFrontend
                [] - Running 'list' command.

2021-01-06 00:11:49,316 INFO  org.apache.hadoop.yarn.client.AHSProxy
                [] - Connecting to Application History server at nm02/
10.93.0.91:10200

2021-01-06 00:11:49,324 INFO  org.apache.flink.yarn.YarnClusterDescriptor
                [] - No path for the flink jar passed. Using the location
of class org.apache.flink.yarn.YarnClusterDescriptor to locate the jar

2021-01-06 00:11:49,333 WARN  org.apache.flink.yarn.YarnClusterDescriptor
                [] - Neither the HADOOP_CONF_DIR nor the YARN_CONF_DIR
environment variable is set.The Flink YARN Client needs one of these to be
set to properly load the Hadoop configuration for accessing YARN.

2021-01-06 00:11:49,404 INFO  org.apache.flink.yarn.YarnClusterDescriptor
                [] - Found Web Interface dn03:37098 of application
'application_1600163418174_0127'.

2021-01-06 00:11:49,758 INFO  org.apache.flink.client.cli.CliFrontend
                [] - Waiting for response...

Waiting for response...

2021-01-06 00:11:49,863 INFO  org.apache.flink.client.cli.CliFrontend
                [] - Successfully retrieved list of jobs

------------------ Running/Restarting Jobs -------------------

31.12.2020 01:22:34 : 76fc265c44ef44ae343ab15868155de6 : stream calculator
(RUNNING)

--------------------------------------------------------------

No scheduled jobs.

After:

2021-01-06 00:06:38,971 INFO
org.apache.flink.runtime.security.modules.HadoopModule
      [] - Hadoop user set to deploy (auth:SIMPLE)

2021-01-06 00:06:38,976 INFO
org.apache.flink.runtime.security.modules.JaasModule
        [] - Jaas file will be created as
/tmp/jaas-3613274701724362777.conf.

2021-01-06 00:06:38,982 INFO  org.apache.flink.client.cli.CliFrontend
                [] - Running 'list' command.

2021-01-06 00:06:39,304 INFO  org.apache.hadoop.yarn.client.AHSProxy
                [] - Connecting to Application History server at nm02/
10.93.0.91:10200

2021-01-06 00:06:39,312 INFO  org.apache.flink.yarn.YarnClusterDescriptor
                [] - No path for the flink jar passed. Using the location
of class org.apache.flink.yarn.YarnClusterDescriptor to locate the jar

2021-01-06 00:06:39,320 WARN  org.apache.flink.yarn.YarnClusterDescriptor
                [] - Neither the HADOOP_CONF_DIR nor the YARN_CONF_DIR
environment variable is set.The Flink YARN Client needs one of these to be
set to properly load the Hadoop configuration for accessing YARN.

2021-01-06 00:06:39,388 INFO  org.apache.flink.yarn.YarnClusterDescriptor
                [] - Found Web Interface dn03:37098 of application
'application_1600163418174_0127'.

2021-01-06 00:06:39,399 INFO  org.apache.flink.runtime.util.ZooKeeperUtils
              [] - Enforcing default ACL for ZK connections

2021-01-06 00:06:39,399 INFO  org.apache.flink.runtime.util.ZooKeeperUtils
              [] - Using '/recovery/default' as Zookeeper namespace.

2021-01-06 00:06:39,425 INFO
org.apache.flink.shaded.curator4.org.apache.curator.utils.Compatibility
[] - Running in ZooKeeper 3.4.x compatibility mode

2021-01-06 00:06:39,425 INFO
org.apache.flink.shaded.curator4.org.apache.curator.utils.Compatibility
[] - Using emulated InjectSessionExpiration

2021-01-06 00:06:39,447 INFO
org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl
[] - Starting

2021-01-06 00:06:39,455 INFO
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ZooKeeper
[] - Initiating client connection, connectString=nm01:2181,

nm02:2181,nm03:2181 sessionTimeout=60000
watcher=org.apache.flink.shaded.curator4.org.apache.curator.ConnectionState@7668d560

2021-01-06 00:06:39,466 INFO
org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl
[] - Default schema

2021-01-06 00:06:39,466 WARN
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn
[] - SASL configuration failed: javax.security.auth.login.LoginException:
No JAAS configuration section named 'Client' was found in specified JAAS
configuration file: '/tmp/jaas-3613274701724362777.conf'. Will continue
connection to Zookeeper server without SASL authentication, if Zookeeper
server allows it.

2021-01-06 00:06:39,467 INFO
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn
[] - Opening socket connection to server nm01/10.93.0.32:2181

2021-01-06 00:06:39,467 INFO
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn
[] - Socket connection established to nm01/10.93.0.32:2181, initiating
session

2021-01-06 00:06:39,467 ERROR
org.apache.flink.shaded.curator4.org.apache.curator.ConnectionState [] -
Authentication failed

2021-01-06 00:06:39,477 INFO
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn
[] - Session establishment complete on server nm01/10.93.0.32:2181,
sessionid = 0x176d1f2c2280016, negotiated timeout = 60000

2021-01-06 00:06:39,478 INFO
org.apache.flink.shaded.curator4.org.apache.curator.framework.state.ConnectionStateManager
[] - State change: CONNECTED

2021-01-06 00:06:39,658 INFO
org.apache.flink.runtime.leaderretrieval.DefaultLeaderRetrievalService
[] - Starting DefaultLeaderRetrievalService with
ZookeeperLeaderRetrievalDriver{retrievalPath='/leader/rest_server_lock'}.

2021-01-06 00:06:39,667 INFO  org.apache.flink.client.cli.CliFrontend
                [] - Waiting for response...

Waiting for response...


# here it took almost 30 seconds


2021-01-06 00:07:09,670 INFO
org.apache.flink.runtime.leaderretrieval.DefaultLeaderRetrievalService
[] - Stopping DefaultLeaderRetrievalService.

2021-01-06 00:07:09,670 INFO
org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalDriver
[] - Closing
ZookeeperLeaderRetrievalDriver{retrievalPath='/leader/rest_server_lock'}.

2021-01-06 00:07:09,671 INFO
org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl
[] - backgroundOperationsLoop exiting

2021-01-06 00:07:09,679 INFO
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ZooKeeper
[] - Session: 0x176d1f2c2280016 closed

2021-01-06 00:07:09,679 INFO
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn
[] - EventThread shut down for session: 0x176d1f2c2280016

2021-01-06 00:07:09,680 ERROR org.apache.flink.client.cli.CliFrontend
                [] - Error while running the command.

org.apache.flink.util.FlinkException: Failed to retrieve job list.

at org.apache.flink.client.cli.CliFrontend.listJobs(CliFrontend.java:436)
~[flink-dist_2.11-1.12.0.jar:1.12.0]

at
org.apache.flink.client.cli.CliFrontend.lambda$list$0(CliFrontend.java:418)
~[flink-dist_2.11-1.12.0.jar:1.12.0]

at
org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:919)
~[flink-dist_2.11-1.12.0.jar:1.12.0]

at org.apache.flink.client.cli.CliFrontend.list(CliFrontend.java:415)
~[flink-dist_2.11-1.12.0.jar:1.12.0]

at
org.apache.flink.client.cli.CliFrontend.parseAndRun(CliFrontend.java:977)
~[flink-dist_2.11-1.12.0.jar:1.12.0]

at
org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:1047)
~[flink-dist_2.11-1.12.0.jar:1.12.0]

at java.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_222]

at javax.security.auth.Subject.doAs(Subject.java:422) [?:1.8.0_222]

at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
[hadoop-common-3.1.1.3.1.4.0-315.jar:?]

at
org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
[flink-dist_2.11-1.12.0.jar:1.12.0]

at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1047)
[flink-dist_2.11-1.12.0.jar:1.12.0]

Caused by: java.util.concurrent.TimeoutException

at
org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:1168)
~[flink-dist_2.11-1.12.0.jar:1.12.0]

at
org.apache.flink.runtime.concurrent.DirectExecutorService.execute(DirectExecutorService.java:211)
~[flink-dist_2.11-1.12.0.jar:1.12.0]

at
org.apache.flink.runtime.concurrent.FutureUtils.lambda$orTimeout$15(FutureUtils.java:549)
~[flink-dist_2.11-1.12.0.jar:1.12.0]

at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
~[?:1.8.0_222]

at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_222]

at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
~[?:1.8.0_222]

at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
~[?:1.8.0_222]

at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
~[?:1.8.0_222]

at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
~[?:1.8.0_222]

at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_222]


------------------------------------------------------------

 The program finished with the following exception:


org.apache.flink.util.FlinkException: Failed to retrieve job list.

at org.apache.flink.client.cli.CliFrontend.listJobs(CliFrontend.java:436)

at
org.apache.flink.client.cli.CliFrontend.lambda$list$0(CliFrontend.java:418)

at
org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:919)

at org.apache.flink.client.cli.CliFrontend.list(CliFrontend.java:415)

at org.apache.flink.client.cli.CliFrontend.parseAndRun(CliFrontend.java:977)

at
org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:1047)

at java.security.AccessController.doPrivileged(Native Method)

at javax.security.auth.Subject.doAs(Subject.java:422)

at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)

at
org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)

at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1047)

Caused by: java.util.concurrent.TimeoutException

at
org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:1168)

at
org.apache.flink.runtime.concurrent.DirectExecutorService.execute(DirectExecutorService.java:211)

at
org.apache.flink.runtime.concurrent.FutureUtils.lambda$orTimeout$15(FutureUtils.java:549)

at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)

at java.util.concurrent.FutureTask.run(FutureTask.java:266)

at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)

at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)

at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)

at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

at java.lang.Thread.run(Thread.java:748)

Why is the zookeeper specified for HA used in this process?

No way to avoid such behavior?

Best,

Dongwon

Reply via email to