Hi Till, I will use the approach with a k8s deployment and HA mode with a single job manager. Nonetheless, here are the logs I just produced by repeating the aforementioned experiment, hope they help in debugging:
*- Starting Jobmanager-1:* Starting Job Manager sed: cannot rename /opt/flink/conf/sedR98XPn: Device or resource busy config file: jobmanager.rpc.address: flink-jobmanager-1 jobmanager.rpc.port: 6123 jobmanager.heap.size: 8192 taskmanager.heap.size: 8192 taskmanager.numberOfTaskSlots: 4 high-availability: zookeeper high-availability.storageDir: hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/high-availability high-availability.zookeeper.quorum: zk-cs:2181 high-availability.zookeeper.path.root: /flink high-availability.jobmanager.port: 50010 state.backend: filesystem state.checkpoints.dir: hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/checkpoints state.savepoints.dir: hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/savepoints state.backend.incremental: false fs.default-scheme: hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020 rest.port: 8081 web.upload.dir: /opt/flink/upload query.server.port: 6125 taskmanager.numberOfTaskSlots: 4 classloader.parent-first-patterns.additional: org.apache.xerces. blob.storage.directory: /opt/flink/blob-server blob.server.port: 6124 blob.server.port: 6124 query.server.port: 6125 Starting standalonesession as a console application on host flink-jobmanager-1-f76fd4df8-ftwt9. 2018-08-29 11:41:48,806 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - -------------------------------------------------------------------------------- 2018-08-29 11:41:48,807 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Starting StandaloneSessionClusterEntrypoint (Version: 1.5.3, Rev:614f216, Date:16.08.2018 @ 06:39:50 GMT) 2018-08-29 11:41:48,807 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - OS current user: flink 2018-08-29 11:41:49,134 WARN org.apache.hadoop.util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2018-08-29 11:41:49,210 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Current Hadoop/Kerberos user: flink 2018-08-29 11:41:49,210 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - JVM: OpenJDK 64-Bit Server VM - Oracle Corporation - 1.8/25.181-b13 2018-08-29 11:41:49,210 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Maximum heap size: 6702 MiBytes 2018-08-29 11:41:49,210 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - JAVA_HOME: /docker-java-home/jre 2018-08-29 11:41:49,213 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Hadoop version: 2.7.5 2018-08-29 11:41:49,213 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - JVM Options: 2018-08-29 11:41:49,213 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - -Dlog4j.configuration=file:/opt/flink/conf/log4j-console.properties 2018-08-29 11:41:49,213 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - -Dlogback.configurationFile=file:/opt/flink/conf/logback-console.xml 2018-08-29 11:41:49,213 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Program Arguments: 2018-08-29 11:41:49,213 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - --configDir 2018-08-29 11:41:49,213 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - /opt/flink/conf 2018-08-29 11:41:49,213 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - --executionMode 2018-08-29 11:41:49,213 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - cluster 2018-08-29 11:41:49,214 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - --host 2018-08-29 11:41:49,214 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - cluster 2018-08-29 11:41:49,214 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Classpath: /opt/flink/lib/flink-python_2.11-1.5.3.jar:/opt/flink/lib/flink-shaded-hadoop2-uber-1.5.3.jar:/opt/flink/lib/log4j-1.2.17.jar:/opt/flink/lib/slf4j-log4j12-1.7.7.jar:/opt/flink/lib/flink-dist_2.11-1.5.3.jar::: 2018-08-29 11:41:49,214 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - -------------------------------------------------------------------------------- 2018-08-29 11:41:49,215 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Registered UNIX signal handlers for [TERM, HUP, INT] 2018-08-29 11:41:49,221 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.rpc.address, flink-jobmanager-1 2018-08-29 11:41:49,221 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.rpc.port, 6123 2018-08-29 11:41:49,221 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.heap.size, 8192 2018-08-29 11:41:49,221 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: taskmanager.heap.size, 8192 2018-08-29 11:41:49,221 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: taskmanager.numberOfTaskSlots, 4 2018-08-29 11:41:49,222 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: high-availability, zookeeper 2018-08-29 11:41:49,222 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: high-availability.storageDir, hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/high-availability 2018-08-29 11:41:49,222 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: high-availability.zookeeper.quorum, zk-cs:2181 2018-08-29 11:41:49,222 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: high-availability.zookeeper.path.root, /flink 2018-08-29 11:41:49,223 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: high-availability.jobmanager.port, 50010 2018-08-29 11:41:49,223 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: state.backend, filesystem 2018-08-29 11:41:49,223 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: state.checkpoints.dir, hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/checkpoints 2018-08-29 11:41:49,223 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: state.savepoints.dir, hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/savepoints 2018-08-29 11:41:49,223 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: state.backend.incremental, false 2018-08-29 11:41:49,224 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: fs.default-scheme, hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020 2018-08-29 11:41:49,224 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: rest.port, 8081 2018-08-29 11:41:49,224 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: web.upload.dir, /opt/flink/upload 2018-08-29 11:41:49,224 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: query.server.port, 6125 2018-08-29 11:41:49,225 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: taskmanager.numberOfTaskSlots, 4 2018-08-29 11:41:49,225 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: classloader.parent-first-patterns.additional, org.apache.xerces. 2018-08-29 11:41:49,225 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: blob.storage.directory, /opt/flink/blob-server 2018-08-29 11:41:49,225 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: blob.server.port, 6124 2018-08-29 11:41:49,225 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: blob.server.port, 6124 2018-08-29 11:41:49,225 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: query.server.port, 6125 2018-08-29 11:41:49,239 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Starting StandaloneSessionClusterEntrypoint. 2018-08-29 11:41:49,239 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Install default filesystem. 2018-08-29 11:41:49,250 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Install security context. 2018-08-29 11:41:49,282 INFO org.apache.flink.runtime.security.modules.HadoopModule - Hadoop user set to flink (auth:SIMPLE) 2018-08-29 11:41:49,298 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Initializing cluster services. 2018-08-29 11:41:49,309 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Trying to start actor system at flink-jobmanager-1:50010 2018-08-29 11:41:49,768 INFO akka.event.slf4j.Slf4jLogger - Slf4jLogger started 2018-08-29 11:41:49,823 INFO akka.remote.Remoting - Starting remoting 2018-08-29 11:41:49,974 INFO akka.remote.Remoting - Remoting started; listening on addresses :[akka.tcp://flink@flink-jobmanager-1:50010] 2018-08-29 11:41:49,981 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Actor system started at akka.tcp://flink@flink-jobmanager-1:50010 2018-08-29 11:41:50,444 INFO org.apache.flink.runtime.blob.FileSystemBlobStore - Creating highly available BLOB storage directory at hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/high-availability//default/blob 2018-08-29 11:41:50,509 INFO org.apache.flink.runtime.util.ZooKeeperUtils - Enforcing default ACL for ZK connections 2018-08-29 11:41:50,509 INFO org.apache.flink.runtime.util.ZooKeeperUtils - Using '/flink/default' as Zookeeper namespace. 2018-08-29 11:41:50,568 INFO org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl - Starting 2018-08-29 11:41:50,577 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper - Client environment:zookeeper.version=3.4.10-39d3a4f269333c922ed3db283be479f9deacaa0f, built on 03/23/2017 10:13 GMT 2018-08-29 11:41:50,577 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper - Client environment:host.name=flink-jobmanager-1-f76fd4df8-ftwt9 2018-08-29 11:41:50,577 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper - Client environment:java.version=1.8.0_181 2018-08-29 11:41:50,577 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper - Client environment:java.vendor=Oracle Corporation 2018-08-29 11:41:50,577 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper - Client environment:java.home=/usr/lib/jvm/java-8-openjdk-amd64/jre 2018-08-29 11:41:50,577 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper - Client environment:java.class.path=/opt/flink/lib/flink-python_2.11-1.5.3.jar:/opt/flink/lib/flink-shaded-hadoop2-uber-1.5.3.jar:/opt/flink/lib/log4j-1.2.17.jar:/opt/flink/lib/slf4j-log4j12-1.7.7.jar:/opt/flink/lib/flink-dist_2.11-1.5.3.jar::: 2018-08-29 11:41:50,577 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper - Client environment:java.library.path=/usr/java/packages/lib/amd64:/usr/lib/x86_64-linux-gnu/jni:/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu:/usr/lib/jni:/lib:/usr/lib 2018-08-29 11:41:50,577 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper - Client environment:java.io.tmpdir=/tmp 2018-08-29 11:41:50,577 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper - Client environment:java.compiler=<NA> 2018-08-29 11:41:50,577 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper - Client environment:os.name=Linux 2018-08-29 11:41:50,577 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper - Client environment:os.arch=amd64 2018-08-29 11:41:50,577 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper - Client environment:os.version=4.4.0-1027-gke 2018-08-29 11:41:50,578 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper - Client environment:user.name=flink 2018-08-29 11:41:50,578 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper - Client environment:user.home=/opt/flink 2018-08-29 11:41:50,578 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper - Client environment:user.dir=/opt/flink 2018-08-29 11:41:50,578 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper - Initiating client connection, connectString=zk-cs:2181 sessionTimeout=60000 watcher=org.apache.flink.shaded.curator.org.apache.curator.ConnectionState@17ae7628 2018-08-29 11:41:50,605 INFO org.apache.flink.runtime.blob.BlobServer - Created BLOB server storage directory /opt/flink/blob-server/blobStore-d408cea8-2ed0-461a-a30a-a62b70fd332a 2018-08-29 11:41:50,605 WARN org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - SASL configuration failed: javax.security.auth.login.LoginException: No JAAS configuration section named 'Client' was found in specified JAAS configuration file: '/tmp/jaas-5372401662150571998.conf'. Will continue connection to Zookeeper server without SASL authentication, if Zookeeper server allows it. 2018-08-29 11:41:50,607 INFO org.apache.flink.runtime.blob.BlobServer - Started BLOB server at 0.0.0.0:6124 - max concurrent requests: 50 - max backlog: 1000 2018-08-29 11:41:50,607 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Opening socket connection to server zk-cs.default.svc.cluster.local/ 10.27.248.104:2181 2018-08-29 11:41:50,608 ERROR org.apache.flink.shaded.curator.org.apache.curator.ConnectionState - Authentication failed 2018-08-29 11:41:50,609 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Socket connection established to zk-cs.default.svc.cluster.local/10.27.248.104:2181, initiating session 2018-08-29 11:41:50,618 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Session establishment complete on server zk-cs.default.svc.cluster.local/ 10.27.248.104:2181, sessionid = 0x26584fd55690005, negotiated timeout = 40000 2018-08-29 11:41:50,619 INFO org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager - State change: CONNECTED 2018-08-29 11:41:50,627 INFO org.apache.flink.runtime.metrics.MetricRegistryImpl - No metrics reporter configured, no metrics will be exposed/reported. 2018-08-29 11:41:50,633 INFO org.apache.flink.runtime.dispatcher.FileArchivedExecutionGraphStore - Initializing FileArchivedExecutionGraphStore: Storage directory /tmp/executionGraphStore-c5df0b39-86f3-4fba-bdda-aacca4f86086, expiration time 3600000, maximum cache size 52428800 bytes. 2018-08-29 11:41:50,659 INFO org.apache.flink.runtime.blob.TransientBlobCache - Created BLOB cache storage directory /opt/flink/blob-server/blobStore-c12d55af-3c2d-4fc2-8ee8-6de642522184 2018-08-29 11:41:50,674 WARN org.apache.flink.configuration.Configuration - Config uses deprecated configuration key 'jobmanager.rpc.address' instead of proper key 'rest.address' 2018-08-29 11:41:50,675 WARN org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint - Upload directory /opt/flink/upload/flink-web-upload does not exist, or has been deleted externally. Previously uploaded files are no longer available. 2018-08-29 11:41:50,676 INFO org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint - Created directory /opt/flink/upload/flink-web-upload for file uploads. 2018-08-29 11:41:50,679 INFO org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint - Starting rest endpoint. 2018-08-29 11:41:50,995 WARN org.apache.flink.runtime.webmonitor.WebMonitorUtils - Log file environment variable 'log.file' is not set. 2018-08-29 11:41:50,995 WARN org.apache.flink.runtime.webmonitor.WebMonitorUtils - JobManager log files are unavailable in the web dashboard. Log file location not found in environment variable 'log.file' or configuration key 'Key: 'web.log.path' , default: null (deprecated keys: [jobmanager.web.log.path])'. 2018-08-29 11:41:51,071 INFO org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint - Rest endpoint listening at flink-jobmanager-1:8081 2018-08-29 11:41:51,071 INFO org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - Starting ZooKeeperLeaderElectionService ZooKeeperLeaderElectionService{leaderPath='/leader/rest_server_lock'}. 2018-08-29 11:41:51,091 WARN org.apache.flink.shaded.curator.org.apache.curator.utils.ZKPaths - The version of ZooKeeper being used doesn't support Container nodes. CreateMode.PERSISTENT will be used instead. 2018-08-29 11:41:51,101 INFO org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint - Web frontend listening at http://flink-jobmanager-1:8081. 2018-08-29 11:41:51,114 INFO org.apache.flink.runtime.rpc.akka.AkkaRpcService - Starting RPC endpoint for org.apache.flink.runtime.resourcemanager.StandaloneResourceManager at akka://flink/user/resourcemanager . 2018-08-29 11:41:51,141 INFO org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint - http://flink-jobmanager-1:8081 was granted leadership with leaderSessionID=bb0d4dfd-c2c4-480b-bc86-62e231a606dd 2018-08-29 11:41:51,214 INFO org.apache.flink.runtime.rpc.akka.AkkaRpcService - Starting RPC endpoint for org.apache.flink.runtime.dispatcher.StandaloneDispatcher at akka://flink/user/dispatcher . 2018-08-29 11:41:51,230 INFO org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - Starting ZooKeeperLeaderElectionService ZooKeeperLeaderElectionService{leaderPath='/leader/resource_manager_lock'}. 2018-08-29 11:41:51,232 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Starting ZooKeeperLeaderRetrievalService /leader/resource_manager_lock. 2018-08-29 11:41:51,234 INFO org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - Starting ZooKeeperLeaderElectionService ZooKeeperLeaderElectionService{leaderPath='/leader/dispatcher_lock'}. 2018-08-29 11:41:51,235 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Starting ZooKeeperLeaderRetrievalService /leader/dispatcher_lock. 2018-08-29 11:41:51,253 INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - ResourceManager akka.tcp://flink@flink-jobmanager-1:50010/user/resourcemanager was granted leadership with fencing token ba47ed8daa8ff16bea6fc355c13f4d49 2018-08-29 11:41:51,254 INFO org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager - Starting the SlotManager. 2018-08-29 11:41:51,263 INFO org.apache.flink.runtime.dispatcher.StandaloneDispatcher - Dispatcher akka.tcp://flink@flink-jobmanager-1:50010/user/dispatcher was granted leadership with fencing token 703301bf-85e7-4464-990f-ad39128a7b4d 2018-08-29 11:41:51,263 INFO org.apache.flink.runtime.dispatcher.StandaloneDispatcher - Recovering all persisted jobs. 2018-08-29 11:41:51,468 INFO org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager - Registering TaskManager c8a3201d58d87dbbe16f8eb352b5c5b6 under 1c5bf0bc3848bd384b6f032ff7213754 at the SlotManager. 2018-08-29 11:41:51,471 INFO org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager - Registering TaskManager 104d18b72fed054620e58e120a1ea083 under e9d3e8ad3b477dd2e58bcb88a2c0d061 at the SlotManager. *Starting Jobmanager-2:* Starting Job Manager sed: cannot rename /opt/flink/conf/sedH2ZiSu: Device or resource busy config file: jobmanager.rpc.address: flink-jobmanager-2 jobmanager.rpc.port: 6123 jobmanager.heap.size: 8192 taskmanager.heap.size: 8192 taskmanager.numberOfTaskSlots: 4 high-availability: zookeeper high-availability.storageDir: hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/high-availability high-availability.zookeeper.quorum: zk-cs:2181 high-availability.zookeeper.path.root: /flink high-availability.jobmanager.port: 50010 state.backend: filesystem state.checkpoints.dir: hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/checkpoints state.savepoints.dir: hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/savepoints state.backend.incremental: false fs.default-scheme: hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020 rest.port: 8081 web.upload.dir: /opt/flink/upload query.server.port: 6125 taskmanager.numberOfTaskSlots: 4 classloader.parent-first-patterns.additional: org.apache.xerces. blob.storage.directory: /opt/flink/blob-server blob.server.port: 6124 blob.server.port: 6124 query.server.port: 6125 Starting standalonesession as a console application on host flink-jobmanager-2-7844b78c9-kmvw9. 2018-08-29 11:41:51,688 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - -------------------------------------------------------------------------------- 2018-08-29 11:41:51,690 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Starting StandaloneSessionClusterEntrypoint (Version: 1.5.3, Rev:614f216, Date:16.08.2018 @ 06:39:50 GMT) 2018-08-29 11:41:51,690 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - OS current user: flink 2018-08-29 11:41:52,018 WARN org.apache.hadoop.util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2018-08-29 11:41:52,088 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Current Hadoop/Kerberos user: flink 2018-08-29 11:41:52,088 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - JVM: OpenJDK 64-Bit Server VM - Oracle Corporation - 1.8/25.181-b13 2018-08-29 11:41:52,088 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Maximum heap size: 6702 MiBytes 2018-08-29 11:41:52,088 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - JAVA_HOME: /docker-java-home/jre 2018-08-29 11:41:52,091 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Hadoop version: 2.7.5 2018-08-29 11:41:52,091 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - JVM Options: 2018-08-29 11:41:52,091 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - -Dlog4j.configuration=file:/opt/flink/conf/log4j-console.properties 2018-08-29 11:41:52,091 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - -Dlogback.configurationFile=file:/opt/flink/conf/logback-console.xml 2018-08-29 11:41:52,091 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Program Arguments: 2018-08-29 11:41:52,091 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - --configDir 2018-08-29 11:41:52,091 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - /opt/flink/conf 2018-08-29 11:41:52,091 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - --executionMode 2018-08-29 11:41:52,091 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - cluster 2018-08-29 11:41:52,091 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - --host 2018-08-29 11:41:52,091 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - cluster 2018-08-29 11:41:52,091 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Classpath: /opt/flink/lib/flink-python_2.11-1.5.3.jar:/opt/flink/lib/flink-shaded-hadoop2-uber-1.5.3.jar:/opt/flink/lib/log4j-1.2.17.jar:/opt/flink/lib/slf4j-log4j12-1.7.7.jar:/opt/flink/lib/flink-dist_2.11-1.5.3.jar::: 2018-08-29 11:41:52,091 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - -------------------------------------------------------------------------------- 2018-08-29 11:41:52,092 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Registered UNIX signal handlers for [TERM, HUP, INT] 2018-08-29 11:41:52,103 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.rpc.address, flink-jobmanager-2 2018-08-29 11:41:52,103 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.rpc.port, 6123 2018-08-29 11:41:52,103 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.heap.size, 8192 2018-08-29 11:41:52,104 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: taskmanager.heap.size, 8192 2018-08-29 11:41:52,104 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: taskmanager.numberOfTaskSlots, 4 2018-08-29 11:41:52,104 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: high-availability, zookeeper 2018-08-29 11:41:52,104 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: high-availability.storageDir, hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/high-availability 2018-08-29 11:41:52,104 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: high-availability.zookeeper.quorum, zk-cs:2181 2018-08-29 11:41:52,104 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: high-availability.zookeeper.path.root, /flink 2018-08-29 11:41:52,105 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: high-availability.jobmanager.port, 50010 2018-08-29 11:41:52,105 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: state.backend, filesystem 2018-08-29 11:41:52,105 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: state.checkpoints.dir, hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/checkpoints 2018-08-29 11:41:52,105 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: state.savepoints.dir, hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/savepoints 2018-08-29 11:41:52,105 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: state.backend.incremental, false 2018-08-29 11:41:52,106 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: fs.default-scheme, hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020 2018-08-29 11:41:52,106 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: rest.port, 8081 2018-08-29 11:41:52,106 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: web.upload.dir, /opt/flink/upload 2018-08-29 11:41:52,106 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: query.server.port, 6125 2018-08-29 11:41:52,106 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: taskmanager.numberOfTaskSlots, 4 2018-08-29 11:41:52,107 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: classloader.parent-first-patterns.additional, org.apache.xerces. 2018-08-29 11:41:52,107 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: blob.storage.directory, /opt/flink/blob-server 2018-08-29 11:41:52,107 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: blob.server.port, 6124 2018-08-29 11:41:52,107 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: blob.server.port, 6124 2018-08-29 11:41:52,107 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: query.server.port, 6125 2018-08-29 11:41:52,122 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Starting StandaloneSessionClusterEntrypoint. 2018-08-29 11:41:52,123 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Install default filesystem. 2018-08-29 11:41:52,133 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Install security context. 2018-08-29 11:41:52,173 INFO org.apache.flink.runtime.security.modules.HadoopModule - Hadoop user set to flink (auth:SIMPLE) 2018-08-29 11:41:52,188 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Initializing cluster services. 2018-08-29 11:41:52,198 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Trying to start actor system at flink-jobmanager-2:50010 2018-08-29 11:41:52,753 INFO akka.event.slf4j.Slf4jLogger - Slf4jLogger started 2018-08-29 11:41:52,822 INFO akka.remote.Remoting - Starting remoting 2018-08-29 11:41:53,038 INFO akka.remote.Remoting - Remoting started; listening on addresses :[akka.tcp://flink@flink-jobmanager-2:50010] 2018-08-29 11:41:53,046 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Actor system started at akka.tcp://flink@flink-jobmanager-2:50010 2018-08-29 11:41:53,500 INFO org.apache.flink.runtime.blob.FileSystemBlobStore - Creating highly available BLOB storage directory at hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/high-availability//default/blob 2018-08-29 11:41:53,558 INFO org.apache.flink.runtime.util.ZooKeeperUtils - Enforcing default ACL for ZK connections 2018-08-29 11:41:53,559 INFO org.apache.flink.runtime.util.ZooKeeperUtils - Using '/flink/default' as Zookeeper namespace. 2018-08-29 11:41:53,616 INFO org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl - Starting 2018-08-29 11:41:53,624 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper - Client environment:zookeeper.version=3.4.10-39d3a4f269333c922ed3db283be479f9deacaa0f, built on 03/23/2017 10:13 GMT 2018-08-29 11:41:53,625 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper - Client environment:host.name=flink-jobmanager-2-7844b78c9-kmvw9 2018-08-29 11:41:53,625 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper - Client environment:java.version=1.8.0_181 2018-08-29 11:41:53,625 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper - Client environment:java.vendor=Oracle Corporation 2018-08-29 11:41:53,625 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper - Client environment:java.home=/usr/lib/jvm/java-8-openjdk-amd64/jre 2018-08-29 11:41:53,625 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper - Client environment:java.class.path=/opt/flink/lib/flink-python_2.11-1.5.3.jar:/opt/flink/lib/flink-shaded-hadoop2-uber-1.5.3.jar:/opt/flink/lib/log4j-1.2.17.jar:/opt/flink/lib/slf4j-log4j12-1.7.7.jar:/opt/flink/lib/flink-dist_2.11-1.5.3.jar::: 2018-08-29 11:41:53,625 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper - Client environment:java.library.path=/usr/java/packages/lib/amd64:/usr/lib/x86_64-linux-gnu/jni:/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu:/usr/lib/jni:/lib:/usr/lib 2018-08-29 11:41:53,625 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper - Client environment:java.io.tmpdir=/tmp 2018-08-29 11:41:53,625 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper - Client environment:java.compiler=<NA> 2018-08-29 11:41:53,625 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper - Client environment:os.name=Linux 2018-08-29 11:41:53,625 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper - Client environment:os.arch=amd64 2018-08-29 11:41:53,625 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper - Client environment:os.version=4.4.0-1027-gke 2018-08-29 11:41:53,625 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper - Client environment:user.name=flink 2018-08-29 11:41:53,625 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper - Client environment:user.home=/opt/flink 2018-08-29 11:41:53,626 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper - Client environment:user.dir=/opt/flink 2018-08-29 11:41:53,626 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper - Initiating client connection, connectString=zk-cs:2181 sessionTimeout=60000 watcher=org.apache.flink.shaded.curator.org.apache.curator.ConnectionState@17ae7628 2018-08-29 11:41:53,644 WARN org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - SASL configuration failed: javax.security.auth.login.LoginException: No JAAS configuration section named 'Client' was found in specified JAAS configuration file: '/tmp/jaas-8238466329925822361.conf'. Will continue connection to Zookeeper server without SASL authentication, if Zookeeper server allows it. 2018-08-29 11:41:53,646 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Opening socket connection to server zk-cs.default.svc.cluster.local/ 10.27.248.104:2181 2018-08-29 11:41:53,646 INFO org.apache.flink.runtime.blob.BlobServer - Created BLOB server storage directory /opt/flink/blob-server/blobStore-61cdb645-5d0c-47fd-bcf6-84ad16fadade 2018-08-29 11:41:53,646 ERROR org.apache.flink.shaded.curator.org.apache.curator.ConnectionState - Authentication failed 2018-08-29 11:41:53,647 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Socket connection established to zk-cs.default.svc.cluster.local/10.27.248.104:2181, initiating session 2018-08-29 11:41:53,649 INFO org.apache.flink.runtime.blob.BlobServer - Started BLOB server at 0.0.0.0:6124 - max concurrent requests: 50 - max backlog: 1000 2018-08-29 11:41:53,655 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Session establishment complete on server zk-cs.default.svc.cluster.local/ 10.27.248.104:2181, sessionid = 0x26584fd55690006, negotiated timeout = 40000 2018-08-29 11:41:53,656 INFO org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager - State change: CONNECTED 2018-08-29 11:41:53,667 INFO org.apache.flink.runtime.metrics.MetricRegistryImpl - No metrics reporter configured, no metrics will be exposed/reported. 2018-08-29 11:41:53,673 INFO org.apache.flink.runtime.dispatcher.FileArchivedExecutionGraphStore - Initializing FileArchivedExecutionGraphStore: Storage directory /tmp/executionGraphStore-8b236c14-79ee-4a84-b23f-437408c4661a, expiration time 3600000, maximum cache size 52428800 bytes. 2018-08-29 11:41:53,699 INFO org.apache.flink.runtime.blob.TransientBlobCache - Created BLOB cache storage directory /opt/flink/blob-server/blobStore-80c519df-cc6f-4e9c-9cd5-da4077c826f0 2018-08-29 11:41:53,717 WARN org.apache.flink.configuration.Configuration - Config uses deprecated configuration key 'jobmanager.rpc.address' instead of proper key 'rest.address' 2018-08-29 11:41:53,718 WARN org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint - Upload directory /opt/flink/upload/flink-web-upload does not exist, or has been deleted externally. Previously uploaded files are no longer available. 2018-08-29 11:41:53,719 INFO org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint - Created directory /opt/flink/upload/flink-web-upload for file uploads. 2018-08-29 11:41:53,722 INFO org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint - Starting rest endpoint. 2018-08-29 11:41:54,084 WARN org.apache.flink.runtime.webmonitor.WebMonitorUtils - Log file environment variable 'log.file' is not set. 2018-08-29 11:41:54,084 WARN org.apache.flink.runtime.webmonitor.WebMonitorUtils - JobManager log files are unavailable in the web dashboard. Log file location not found in environment variable 'log.file' or configuration key 'Key: 'web.log.path' , default: null (deprecated keys: [jobmanager.web.log.path])'. 2018-08-29 11:41:54,160 INFO org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint - Rest endpoint listening at flink-jobmanager-2:8081 2018-08-29 11:41:54,160 INFO org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - Starting ZooKeeperLeaderElectionService ZooKeeperLeaderElectionService{leaderPath='/leader/rest_server_lock'}. 2018-08-29 11:41:54,180 INFO org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint - Web frontend listening at http://flink-jobmanager-2:8081. 2018-08-29 11:41:54,192 INFO org.apache.flink.runtime.rpc.akka.AkkaRpcService - Starting RPC endpoint for org.apache.flink.runtime.resourcemanager.StandaloneResourceManager at akka://flink/user/resourcemanager . 2018-08-29 11:41:54,273 INFO org.apache.flink.runtime.rpc.akka.AkkaRpcService - Starting RPC endpoint for org.apache.flink.runtime.dispatcher.StandaloneDispatcher at akka://flink/user/dispatcher . 2018-08-29 11:41:54,286 INFO org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - Starting ZooKeeperLeaderElectionService ZooKeeperLeaderElectionService{leaderPath='/leader/resource_manager_lock'}. 2018-08-29 11:41:54,287 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Starting ZooKeeperLeaderRetrievalService /leader/resource_manager_lock. 2018-08-29 11:41:54,289 INFO org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - Starting ZooKeeperLeaderElectionService ZooKeeperLeaderElectionService{leaderPath='/leader/dispatcher_lock'}. 2018-08-29 11:41:54,289 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Starting ZooKeeperLeaderRetrievalService /leader/dispatcher_lock. *Upon submitting a batch job on Jobmanager-1, we immediately get this log on Jobmanager-2* 2018-08-29 11:47:06,249 INFO org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore - Recovered SubmittedJobGraph(d69b67e4d28a2d244b06d3f6d661bca1, null). *Meanwhile Jobmanager-1 gets:* *-FlinkBatchPipelineTranslator pipeline logs- (we use Apache Beam)* 2018-08-29 11:47:06,006 INFO org.apache.flink.runtime.dispatcher.StandaloneDispatcher - Submitting job d69b67e4d28a2d244b06d3f6d661bca1 (sicassandrawriterbeam-flink-0829114703-7d95fabd). 2018-08-29 11:47:06,090 INFO org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore - Added SubmittedJobGraph(d69b67e4d28a2d244b06d3f6d661bca1, null) to ZooKeeper. *-loads of job execution info-* 2018-08-29 11:49:20,272 INFO org.apache.flink.runtime.dispatcher.StandaloneDispatcher - Job d69b67e4d28a2d244b06d3f6d661bca1 reached globally terminal state FINISHED. 2018-08-29 11:49:20,286 INFO org.apache.flink.runtime.jobmaster.JobMaster - Stopping the JobMaster for job sicassandrawriterbeam-flink-0829114703-7d95fabd(d69b67e4d28a2d244b06d3f6d661bca1). 2018-08-29 11:49:20,290 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Stopping ZooKeeperLeaderRetrievalService /leader/resource_manager_lock. 2018-08-29 11:49:20,292 INFO org.apache.flink.runtime.jobmaster.JobMaster - Close ResourceManager connection 827b94881bf7c94d8516907e04e3a564: JobManager is shutting down.. 2018-08-29 11:49:20,292 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPool - Suspending SlotPool. 2018-08-29 11:49:20,293 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPool - Stopping SlotPool. 2018-08-29 11:49:20,293 INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - Disconnect job manager a3dab0a0883c5f0f37943358d9104d79 @akka.tcp://flink@flink-jobmanager-1:50010/user/jobmanager_0 for job d69b67e4d28a2d244b06d3f6d661bca1 from the resource manager. 2018-08-29 11:49:20,293 INFO org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - Stopping ZooKeeperLeaderElectionService ZooKeeperLeaderElectionService{leaderPath='/leader/d69b67e4d28a2d244b06d3f6d661bca1/job_manager_lock'}. 2018-08-29 11:49:20,304 INFO org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore - Removed job graph d69b67e4d28a2d244b06d3f6d661bca1 from ZooKeeper. ------------------- The result is: HDFS has only a jobgraph and an empty default folder - everything else is cleared ZooKeeper has the jobgraph that Jobmanager-1 claims to have removed in the last log still there. On Wed, Aug 29, 2018 at 12:14 PM Till Rohrmann <trohrm...@apache.org> wrote: > Hi Encho, > > it sounds strange that the standby JobManager tries to recover a submitted > job graph. This should only happen if it has been granted leadership. Thus, > it seems as if the standby JobManager thinks that it is also the leader. > Could you maybe share the logs of the two JobManagers/ClusterEntrypoints > with us? > > Running only a single JobManager/ClusterEntrypoint in HA mode via a > Kubernetes Deployment should do the trick and there is nothing wrong with > it. > > Cheers, > Till > > On Wed, Aug 29, 2018 at 11:05 AM Encho Mishinev <encho.mishi...@gmail.com> > wrote: > >> Hello, >> >> Since two job managers don't seem to be working for me I was thinking of >> just using a single job manager in Kubernetes in HA mode with a deployment >> ensuring its restart whenever it fails. Is this approach viable? The >> High-Availability page mentions that you use only one job manager in an >> YARN cluster but does not specify such option for Kubernetes. Is there >> anything that can go wrong with this approach? >> >> Thanks >> >> On Wed, Aug 29, 2018 at 11:10 AM Encho Mishinev <encho.mishi...@gmail.com> >> wrote: >> >>> Hi, >>> >>> Unfortunately the thing I described does indeed happen every time. As >>> mentioned in the first email, I am running on Kubernetes so certain things >>> could be different compared to just a standalone cluster. >>> >>> Any ideas for workarounds are welcome, as this problem basically >>> prevents me from using HA. >>> >>> Thanks, >>> Encho >>> >>> On Wed, Aug 29, 2018 at 5:15 AM vino yang <yanghua1...@gmail.com> wrote: >>> >>>> Hi Encho, >>>> >>>> From your description, I feel that there are extra bugs. >>>> >>>> About your description: >>>> >>>> *- Start both job managers* >>>> *- Start a batch job in JobManager 1 and let it finish* >>>> *The jobgraphs in both Zookeeper and HDFS remained.* >>>> >>>> Is it necessarily happening every time? >>>> >>>> In the Standalone cluster, the problems we encountered were sporadic. >>>> >>>> Thanks, vino. >>>> >>>> Encho Mishinev <encho.mishi...@gmail.com> 于2018年8月28日周二 下午8:07写道: >>>> >>>>> Hello Till, >>>>> >>>>> I spend a few more hours testing and looking at the logs and it seems >>>>> like there's a more general problem here. While the two job managers are >>>>> active neither of them can properly delete jobgraphs. The above problem I >>>>> described comes from the fact that Kubernetes gets JobManager 1 quickly >>>>> after I manually kill it, so when I stop the job on JobManager 2 both are >>>>> alive. >>>>> >>>>> I did a very simple test: >>>>> >>>>> - Start both job managers >>>>> - Start a batch job in JobManager 1 and let it finish >>>>> The jobgraphs in both Zookeeper and HDFS remained. >>>>> >>>>> On the other hand if we do: >>>>> >>>>> - Start only JobManager 1 (again in HA mode) >>>>> - Start a batch job and let it finish >>>>> The jobgraphs in both Zookeeper and HDFS are deleted fine. >>>>> >>>>> It seems like the standby manager still leaves some kind of lock on >>>>> the jobgraphs. Do you think that's possible? Have you seen a similar >>>>> problem? >>>>> The only logs that appear on the standby manager while waiting are of >>>>> the type: >>>>> >>>>> 2018-08-28 11:54:10,789 INFO >>>>> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore - >>>>> Recovered SubmittedJobGraph(9e0a109b57511930c95d3b54574a66e3, null). >>>>> >>>>> Note that this log appears on the standby jobmanager immediately when >>>>> a new job is submitted to the active jobmanager. >>>>> Also note that the blobs and checkpoints are cleared fine. The problem >>>>> is only for jobgraphs both in ZooKeeper and HDFS. >>>>> >>>>> Trying to access the UI of the standby manager redirects to the active >>>>> one, so it is not a problem of them not knowing who the leader is. Do you >>>>> have any ideas? >>>>> >>>>> Thanks a lot, >>>>> Encho >>>>> >>>>> On Tue, Aug 28, 2018 at 10:27 AM Till Rohrmann <trohrm...@apache.org> >>>>> wrote: >>>>> >>>>>> Hi Encho, >>>>>> >>>>>> thanks a lot for reporting this issue. The problem arises whenever >>>>>> the old leader maintains the connection to ZooKeeper. If this is the >>>>>> case, >>>>>> then ephemeral nodes which we create to protect against faulty delete >>>>>> operations are not removed and consequently the new leader is not able to >>>>>> delete the persisted job graph. So one thing to check is whether the old >>>>>> JM >>>>>> still has an open connection to ZooKeeper. The next thing to check is the >>>>>> session timeout of your ZooKeeper cluster. If you stop the job within the >>>>>> session timeout, then it is also not guaranteed that ZooKeeper has >>>>>> detected >>>>>> that the ephemeral nodes of the old JM must be deleted. In order to >>>>>> understand this better it would be helpful if you could tell us the >>>>>> timing >>>>>> of the different actions. >>>>>> >>>>>> Cheers, >>>>>> Till >>>>>> >>>>>> On Tue, Aug 28, 2018 at 8:17 AM vino yang <yanghua1...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Hi Encho, >>>>>>> >>>>>>> A temporary solution can be used to determine if it has been cleaned >>>>>>> up by monitoring the specific JobID under Zookeeper's "/jobgraph". >>>>>>> Another solution, modify the source code, rudely modify the cleanup >>>>>>> mode to the synchronous form, but the flink operation Zookeeper's path >>>>>>> needs to obtain the corresponding lock, so it is dangerous to do so, >>>>>>> and it >>>>>>> is not recommended. >>>>>>> I think maybe this problem can be solved in the next version. It >>>>>>> depends on Till. >>>>>>> >>>>>>> Thanks, vino. >>>>>>> >>>>>>> Encho Mishinev <encho.mishi...@gmail.com> 于2018年8月28日周二 下午1:17写道: >>>>>>> >>>>>>>> Thank you very much for the info! Will keep track of the progress. >>>>>>>> >>>>>>>> In the meantime is there any viable workaround? It seems like HA >>>>>>>> doesn't really work due to this bug. >>>>>>>> >>>>>>>> On Tue, Aug 28, 2018 at 4:52 AM vino yang <yanghua1...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> About some implementation mechanisms. >>>>>>>>> Flink uses Zookeeper to store JobGraph (Job's description >>>>>>>>> information and metadata) as a basis for Job recovery. >>>>>>>>> However, previous implementations may cause this information to >>>>>>>>> not be properly cleaned up because it is asynchronously deleted by a >>>>>>>>> background thread. >>>>>>>>> >>>>>>>>> Thanks, vino. >>>>>>>>> >>>>>>>>> vino yang <yanghua1...@gmail.com> 于2018年8月28日周二 上午9:49写道: >>>>>>>>> >>>>>>>>>> Hi Encho, >>>>>>>>>> >>>>>>>>>> This is a problem already known to the Flink community, you can >>>>>>>>>> track its progress through FLINK-10011[1], and currently Till is >>>>>>>>>> fixing >>>>>>>>>> this issue. >>>>>>>>>> >>>>>>>>>> [1]: https://issues.apache.org/jira/browse/FLINK-10011 >>>>>>>>>> >>>>>>>>>> Thanks, vino. >>>>>>>>>> >>>>>>>>>> Encho Mishinev <encho.mishi...@gmail.com> 于2018年8月27日周一 >>>>>>>>>> 下午10:13写道: >>>>>>>>>> >>>>>>>>>>> I am running Flink 1.5.3 with two job managers and two task >>>>>>>>>>> managers in Kubernetes along with HDFS and Zookeeper in >>>>>>>>>>> high-availability >>>>>>>>>>> mode. >>>>>>>>>>> >>>>>>>>>>> My problem occurs after the following actions: >>>>>>>>>>> - Upload a .jar file to jobmanager-1 >>>>>>>>>>> - Run a streaming job from the jar on jobmanager-1 >>>>>>>>>>> - Wait for 1 or 2 checkpoints to succeed >>>>>>>>>>> - Kill pod of jobmanager-1 >>>>>>>>>>> After a short delay, jobmanager-2 takes leadership and correctly >>>>>>>>>>> restores the job and continues it >>>>>>>>>>> - Stop job from jobmanager-2 >>>>>>>>>>> >>>>>>>>>>> At this point all seems well, but the problem is that >>>>>>>>>>> jobmanager-2 does not clean up anything that was left from >>>>>>>>>>> jobmanager-1. >>>>>>>>>>> This means that both in HDFS and in Zookeeper remain job graphs, >>>>>>>>>>> which >>>>>>>>>>> later on obstruct any work of both managers as after any reset they >>>>>>>>>>> unsuccessfully try to restore a non-existent job and fail over and >>>>>>>>>>> over >>>>>>>>>>> again. >>>>>>>>>>> >>>>>>>>>>> I am quite certain that jobmanager-2 does not know about any of >>>>>>>>>>> jobmanager-1’s files since the Zookeeper logs reveal that it tries >>>>>>>>>>> to >>>>>>>>>>> duplicate job folders: >>>>>>>>>>> >>>>>>>>>>> 2018-08-27 13:11:00,038 [myid:] - INFO [ProcessThread(sid:0 >>>>>>>>>>> cport:2181)::PrepRequestProcessor@648] - Got user-level >>>>>>>>>>> KeeperException when processing sessionid:0x1657aa15e480033 >>>>>>>>>>> type:create >>>>>>>>>>> cxid:0x46 zxid:0x1ab txntype:-1 reqpath:n/a Error >>>>>>>>>>> Path:/flink/default/jobgraphs/83bfa359ca59ce1d4635e18e16651e15/bbb259fd-7826-4950-bc7c-c2be23346c77 >>>>>>>>>>> Error:KeeperErrorCode = NodeExists for >>>>>>>>>>> /flink/default/jobgraphs/83bfa359ca59ce1d4635e18e16651e15/bbb259fd-7826-4950-bc7c-c2be23346c77 >>>>>>>>>>> >>>>>>>>>>> 2018-08-27 13:11:02,296 [myid:] - INFO [ProcessThread(sid:0 >>>>>>>>>>> cport:2181)::PrepRequestProcessor@648] - Got user-level >>>>>>>>>>> KeeperException when processing sessionid:0x1657aa15e480033 >>>>>>>>>>> type:create >>>>>>>>>>> cxid:0x5c zxid:0x1ac txntype:-1 reqpath:n/a Error >>>>>>>>>>> Path:/flink/default/checkpoint-counter/83bfa359ca59ce1d4635e18e16651e15 >>>>>>>>>>> Error:KeeperErrorCode = NodeExists for >>>>>>>>>>> /flink/default/checkpoint-counter/83bfa359ca59ce1d4635e18e16651e15 >>>>>>>>>>> >>>>>>>>>>> Also jobmanager-2 attempts to delete the jobgraphs folder in >>>>>>>>>>> Zookeeper when the job is stopped, but fails since there are >>>>>>>>>>> leftover files >>>>>>>>>>> in it from jobmanager-1: >>>>>>>>>>> >>>>>>>>>>> 2018-08-27 13:12:13,406 [myid:] - INFO [ProcessThread(sid:0 >>>>>>>>>>> cport:2181)::PrepRequestProcessor@648] - Got user-level >>>>>>>>>>> KeeperException when processing sessionid:0x1657aa15e480033 >>>>>>>>>>> type:delete >>>>>>>>>>> cxid:0xa8 zxid:0x1bd txntype:-1 reqpath:n/a Error >>>>>>>>>>> Path:/flink/default/jobgraphs/83bfa359ca59ce1d4635e18e16651e15 >>>>>>>>>>> Error:KeeperErrorCode = Directory not empty for >>>>>>>>>>> /flink/default/jobgraphs/83bfa359ca59ce1d4635e18e16651e15 >>>>>>>>>>> >>>>>>>>>>> I’ve noticed that when restoring the job, it seems like >>>>>>>>>>> jobmanager-2 does not get anything more than jobID, while it >>>>>>>>>>> perhaps needs >>>>>>>>>>> some metadata? Here is the log that seems suspicious to me: >>>>>>>>>>> >>>>>>>>>>> 2018-08-27 13:09:18,113 INFO >>>>>>>>>>> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore >>>>>>>>>>> - >>>>>>>>>>> Recovered SubmittedJobGraph(83bfa359ca59ce1d4635e18e16651e15, null). >>>>>>>>>>> >>>>>>>>>>> All other logs seem fine in jobmanager-2, it doesn’t seem to be >>>>>>>>>>> aware that it’s overwriting anything or not deleting properly. >>>>>>>>>>> >>>>>>>>>>> My question is - what is the intended way for the job managers >>>>>>>>>>> to correctly exchange metadata in HA mode and why is it not working >>>>>>>>>>> for me? >>>>>>>>>>> >>>>>>>>>>> Thanks in advance! >>>>>>>>>> >>>>>>>>>>