Re: JobGraphs not cleaned up in HA mode

Encho Mishinev Wed, 29 Aug 2018 05:00:12 -0700

Hi Till,

I will use the approach with a k8s deployment and HA mode with a single job
manager. Nonetheless, here are the logs I just produced by repeating the
aforementioned experiment, hope they help in debugging:


*- Starting Jobmanager-1:*

Starting Job Manager
sed: cannot rename /opt/flink/conf/sedR98XPn: Device or resource busy
config file:
jobmanager.rpc.address: flink-jobmanager-1
jobmanager.rpc.port: 6123
jobmanager.heap.size: 8192
taskmanager.heap.size: 8192
taskmanager.numberOfTaskSlots: 4
high-availability: zookeeper
high-availability.storageDir:
hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/high-availability
high-availability.zookeeper.quorum: zk-cs:2181
high-availability.zookeeper.path.root: /flink
high-availability.jobmanager.port: 50010
state.backend: filesystem
state.checkpoints.dir:
hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/checkpoints
state.savepoints.dir:
hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/savepoints
state.backend.incremental: false
fs.default-scheme:
hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020
rest.port: 8081
web.upload.dir: /opt/flink/upload
query.server.port: 6125
taskmanager.numberOfTaskSlots: 4
classloader.parent-first-patterns.additional: org.apache.xerces.
blob.storage.directory: /opt/flink/blob-server
blob.server.port: 6124
blob.server.port: 6124
query.server.port: 6125
Starting standalonesession as a console application on host
flink-jobmanager-1-f76fd4df8-ftwt9.
2018-08-29 11:41:48,806 INFO
org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
--------------------------------------------------------------------------------
2018-08-29 11:41:48,807 INFO
org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Starting
StandaloneSessionClusterEntrypoint (Version: 1.5.3, Rev:614f216,
Date:16.08.2018 @ 06:39:50 GMT)
2018-08-29 11:41:48,807 INFO
org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  OS current
user: flink
2018-08-29 11:41:49,134 WARN  org.apache.hadoop.util.NativeCodeLoader
                 - Unable to load native-hadoop library for your
platform... using builtin-java classes where applicable
2018-08-29 11:41:49,210 INFO
org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Current
Hadoop/Kerberos user: flink
2018-08-29 11:41:49,210 INFO
org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  JVM:
OpenJDK 64-Bit Server VM - Oracle Corporation - 1.8/25.181-b13
2018-08-29 11:41:49,210 INFO
org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Maximum
heap size: 6702 MiBytes
2018-08-29 11:41:49,210 INFO
org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  JAVA_HOME:
/docker-java-home/jre
2018-08-29 11:41:49,213 INFO
org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Hadoop
version: 2.7.5
2018-08-29 11:41:49,213 INFO
org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  JVM
Options:
2018-08-29 11:41:49,213 INFO
org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
 -Dlog4j.configuration=file:/opt/flink/conf/log4j-console.properties
2018-08-29 11:41:49,213 INFO
org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
 -Dlogback.configurationFile=file:/opt/flink/conf/logback-console.xml
2018-08-29 11:41:49,213 INFO
org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Program
Arguments:
2018-08-29 11:41:49,213 INFO
org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
 --configDir
2018-08-29 11:41:49,213 INFO
org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
 /opt/flink/conf
2018-08-29 11:41:49,213 INFO
org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
 --executionMode
2018-08-29 11:41:49,213 INFO
org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     cluster
2018-08-29 11:41:49,214 INFO
org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     --host
2018-08-29 11:41:49,214 INFO
org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     cluster
2018-08-29 11:41:49,214 INFO
org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Classpath:
/opt/flink/lib/flink-python_2.11-1.5.3.jar:/opt/flink/lib/flink-shaded-hadoop2-uber-1.5.3.jar:/opt/flink/lib/log4j-1.2.17.jar:/opt/flink/lib/slf4j-log4j12-1.7.7.jar:/opt/flink/lib/flink-dist_2.11-1.5.3.jar:::
2018-08-29 11:41:49,214 INFO
org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
--------------------------------------------------------------------------------
2018-08-29 11:41:49,215 INFO
org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Registered
UNIX signal handlers for [TERM, HUP, INT]
2018-08-29 11:41:49,221 INFO
org.apache.flink.configuration.GlobalConfiguration            - Loading
configuration property: jobmanager.rpc.address, flink-jobmanager-1
2018-08-29 11:41:49,221 INFO
org.apache.flink.configuration.GlobalConfiguration            - Loading
configuration property: jobmanager.rpc.port, 6123
2018-08-29 11:41:49,221 INFO
org.apache.flink.configuration.GlobalConfiguration            - Loading
configuration property: jobmanager.heap.size, 8192
2018-08-29 11:41:49,221 INFO
org.apache.flink.configuration.GlobalConfiguration            - Loading
configuration property: taskmanager.heap.size, 8192
2018-08-29 11:41:49,221 INFO
org.apache.flink.configuration.GlobalConfiguration            - Loading
configuration property: taskmanager.numberOfTaskSlots, 4
2018-08-29 11:41:49,222 INFO
org.apache.flink.configuration.GlobalConfiguration            - Loading
configuration property: high-availability, zookeeper
2018-08-29 11:41:49,222 INFO
org.apache.flink.configuration.GlobalConfiguration            - Loading
configuration property: high-availability.storageDir,
hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/high-availability
2018-08-29 11:41:49,222 INFO
org.apache.flink.configuration.GlobalConfiguration            - Loading
configuration property: high-availability.zookeeper.quorum, zk-cs:2181
2018-08-29 11:41:49,222 INFO
org.apache.flink.configuration.GlobalConfiguration            - Loading
configuration property: high-availability.zookeeper.path.root, /flink
2018-08-29 11:41:49,223 INFO
org.apache.flink.configuration.GlobalConfiguration            - Loading
configuration property: high-availability.jobmanager.port, 50010
2018-08-29 11:41:49,223 INFO
org.apache.flink.configuration.GlobalConfiguration            - Loading
configuration property: state.backend, filesystem
2018-08-29 11:41:49,223 INFO
org.apache.flink.configuration.GlobalConfiguration            - Loading
configuration property: state.checkpoints.dir,
hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/checkpoints
2018-08-29 11:41:49,223 INFO
org.apache.flink.configuration.GlobalConfiguration            - Loading
configuration property: state.savepoints.dir,
hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/savepoints
2018-08-29 11:41:49,223 INFO
org.apache.flink.configuration.GlobalConfiguration            - Loading
configuration property: state.backend.incremental, false
2018-08-29 11:41:49,224 INFO
org.apache.flink.configuration.GlobalConfiguration            - Loading
configuration property: fs.default-scheme,
hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020
2018-08-29 11:41:49,224 INFO
org.apache.flink.configuration.GlobalConfiguration            - Loading
configuration property: rest.port, 8081
2018-08-29 11:41:49,224 INFO
org.apache.flink.configuration.GlobalConfiguration            - Loading
configuration property: web.upload.dir, /opt/flink/upload
2018-08-29 11:41:49,224 INFO
org.apache.flink.configuration.GlobalConfiguration            - Loading
configuration property: query.server.port, 6125
2018-08-29 11:41:49,225 INFO
org.apache.flink.configuration.GlobalConfiguration            - Loading
configuration property: taskmanager.numberOfTaskSlots, 4
2018-08-29 11:41:49,225 INFO
org.apache.flink.configuration.GlobalConfiguration            - Loading
configuration property: classloader.parent-first-patterns.additional,
org.apache.xerces.
2018-08-29 11:41:49,225 INFO
org.apache.flink.configuration.GlobalConfiguration            - Loading
configuration property: blob.storage.directory, /opt/flink/blob-server
2018-08-29 11:41:49,225 INFO
org.apache.flink.configuration.GlobalConfiguration            - Loading
configuration property: blob.server.port, 6124
2018-08-29 11:41:49,225 INFO
org.apache.flink.configuration.GlobalConfiguration            - Loading
configuration property: blob.server.port, 6124
2018-08-29 11:41:49,225 INFO
org.apache.flink.configuration.GlobalConfiguration            - Loading
configuration property: query.server.port, 6125
2018-08-29 11:41:49,239 INFO
org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Starting
StandaloneSessionClusterEntrypoint.
2018-08-29 11:41:49,239 INFO
org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Install
default filesystem.
2018-08-29 11:41:49,250 INFO
org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Install
security context.
2018-08-29 11:41:49,282 INFO
org.apache.flink.runtime.security.modules.HadoopModule        - Hadoop user
set to flink (auth:SIMPLE)
2018-08-29 11:41:49,298 INFO
org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
Initializing cluster services.
2018-08-29 11:41:49,309 INFO
org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Trying to
start actor system at flink-jobmanager-1:50010
2018-08-29 11:41:49,768 INFO  akka.event.slf4j.Slf4jLogger
                - Slf4jLogger started
2018-08-29 11:41:49,823 INFO  akka.remote.Remoting
                - Starting remoting
2018-08-29 11:41:49,974 INFO  akka.remote.Remoting
                - Remoting started; listening on addresses
:[akka.tcp://flink@flink-jobmanager-1:50010]
2018-08-29 11:41:49,981 INFO
org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Actor
system started at akka.tcp://flink@flink-jobmanager-1:50010
2018-08-29 11:41:50,444 INFO
org.apache.flink.runtime.blob.FileSystemBlobStore             - Creating
highly available BLOB storage directory at
hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/high-availability//default/blob
2018-08-29 11:41:50,509 INFO  org.apache.flink.runtime.util.ZooKeeperUtils
                - Enforcing default ACL for ZK connections
2018-08-29 11:41:50,509 INFO  org.apache.flink.runtime.util.ZooKeeperUtils
                - Using '/flink/default' as Zookeeper namespace.
2018-08-29 11:41:50,568 INFO
org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl
- Starting
2018-08-29 11:41:50,577 INFO
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
environment:zookeeper.version=3.4.10-39d3a4f269333c922ed3db283be479f9deacaa0f,
built on 03/23/2017 10:13 GMT
2018-08-29 11:41:50,577 INFO
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
environment:host.name=flink-jobmanager-1-f76fd4df8-ftwt9
2018-08-29 11:41:50,577 INFO
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
environment:java.version=1.8.0_181
2018-08-29 11:41:50,577 INFO
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
environment:java.vendor=Oracle Corporation
2018-08-29 11:41:50,577 INFO
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
environment:java.home=/usr/lib/jvm/java-8-openjdk-amd64/jre
2018-08-29 11:41:50,577 INFO
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
environment:java.class.path=/opt/flink/lib/flink-python_2.11-1.5.3.jar:/opt/flink/lib/flink-shaded-hadoop2-uber-1.5.3.jar:/opt/flink/lib/log4j-1.2.17.jar:/opt/flink/lib/slf4j-log4j12-1.7.7.jar:/opt/flink/lib/flink-dist_2.11-1.5.3.jar:::
2018-08-29 11:41:50,577 INFO
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
environment:java.library.path=/usr/java/packages/lib/amd64:/usr/lib/x86_64-linux-gnu/jni:/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu:/usr/lib/jni:/lib:/usr/lib
2018-08-29 11:41:50,577 INFO
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
environment:java.io.tmpdir=/tmp
2018-08-29 11:41:50,577 INFO
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
environment:java.compiler=<NA>
2018-08-29 11:41:50,577 INFO
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
environment:os.name=Linux
2018-08-29 11:41:50,577 INFO
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
environment:os.arch=amd64
2018-08-29 11:41:50,577 INFO
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
environment:os.version=4.4.0-1027-gke
2018-08-29 11:41:50,578 INFO
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
environment:user.name=flink
2018-08-29 11:41:50,578 INFO
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
environment:user.home=/opt/flink
2018-08-29 11:41:50,578 INFO
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
environment:user.dir=/opt/flink
2018-08-29 11:41:50,578 INFO
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  -
Initiating client connection, connectString=zk-cs:2181 sessionTimeout=60000
watcher=org.apache.flink.shaded.curator.org.apache.curator.ConnectionState@17ae7628
2018-08-29 11:41:50,605 INFO  org.apache.flink.runtime.blob.BlobServer
                - Created BLOB server storage directory
/opt/flink/blob-server/blobStore-d408cea8-2ed0-461a-a30a-a62b70fd332a
2018-08-29 11:41:50,605 WARN
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - SASL
configuration failed: javax.security.auth.login.LoginException: No JAAS
configuration section named 'Client' was found in specified JAAS
configuration file: '/tmp/jaas-5372401662150571998.conf'. Will continue
connection to Zookeeper server without SASL authentication, if Zookeeper
server allows it.
2018-08-29 11:41:50,607 INFO  org.apache.flink.runtime.blob.BlobServer
                - Started BLOB server at 0.0.0.0:6124 - max concurrent
requests: 50 - max backlog: 1000
2018-08-29 11:41:50,607 INFO
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  -
Opening socket connection to server zk-cs.default.svc.cluster.local/
10.27.248.104:2181
2018-08-29 11:41:50,608 ERROR
org.apache.flink.shaded.curator.org.apache.curator.ConnectionState  -
Authentication failed
2018-08-29 11:41:50,609 INFO
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Socket
connection established to zk-cs.default.svc.cluster.local/10.27.248.104:2181,
initiating session
2018-08-29 11:41:50,618 INFO
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  -
Session establishment complete on server zk-cs.default.svc.cluster.local/
10.27.248.104:2181, sessionid = 0x26584fd55690005, negotiated timeout =
40000
2018-08-29 11:41:50,619 INFO
org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager
- State change: CONNECTED
2018-08-29 11:41:50,627 INFO
org.apache.flink.runtime.metrics.MetricRegistryImpl           - No metrics
reporter configured, no metrics will be exposed/reported.
2018-08-29 11:41:50,633 INFO
org.apache.flink.runtime.dispatcher.FileArchivedExecutionGraphStore  -
Initializing FileArchivedExecutionGraphStore: Storage directory
/tmp/executionGraphStore-c5df0b39-86f3-4fba-bdda-aacca4f86086, expiration
time 3600000, maximum cache size 52428800 bytes.
2018-08-29 11:41:50,659 INFO
org.apache.flink.runtime.blob.TransientBlobCache              - Created
BLOB cache storage directory
/opt/flink/blob-server/blobStore-c12d55af-3c2d-4fc2-8ee8-6de642522184
2018-08-29 11:41:50,674 WARN  org.apache.flink.configuration.Configuration
                - Config uses deprecated configuration key
'jobmanager.rpc.address' instead of proper key 'rest.address'
2018-08-29 11:41:50,675 WARN
org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - Upload
directory /opt/flink/upload/flink-web-upload does not exist, or has been
deleted externally. Previously uploaded files are no longer available.
2018-08-29 11:41:50,676 INFO
org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - Created
directory /opt/flink/upload/flink-web-upload for file uploads.
2018-08-29 11:41:50,679 INFO
org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - Starting
rest endpoint.
2018-08-29 11:41:50,995 WARN
org.apache.flink.runtime.webmonitor.WebMonitorUtils           - Log file
environment variable 'log.file' is not set.
2018-08-29 11:41:50,995 WARN
org.apache.flink.runtime.webmonitor.WebMonitorUtils           - JobManager
log files are unavailable in the web dashboard. Log file location not found
in environment variable 'log.file' or configuration key 'Key:
'web.log.path' , default: null (deprecated keys:
[jobmanager.web.log.path])'.
2018-08-29 11:41:51,071 INFO
org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - Rest
endpoint listening at flink-jobmanager-1:8081
2018-08-29 11:41:51,071 INFO
org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  -
Starting ZooKeeperLeaderElectionService
ZooKeeperLeaderElectionService{leaderPath='/leader/rest_server_lock'}.
2018-08-29 11:41:51,091 WARN
org.apache.flink.shaded.curator.org.apache.curator.utils.ZKPaths  - The
version of ZooKeeper being used doesn't support Container nodes.
CreateMode.PERSISTENT will be used instead.
2018-08-29 11:41:51,101 INFO
org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - Web
frontend listening at http://flink-jobmanager-1:8081.
2018-08-29 11:41:51,114 INFO
org.apache.flink.runtime.rpc.akka.AkkaRpcService              - Starting
RPC endpoint for
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager at
akka://flink/user/resourcemanager .
2018-08-29 11:41:51,141 INFO
org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    -
http://flink-jobmanager-1:8081 was granted leadership with
leaderSessionID=bb0d4dfd-c2c4-480b-bc86-62e231a606dd
2018-08-29 11:41:51,214 INFO
org.apache.flink.runtime.rpc.akka.AkkaRpcService              - Starting
RPC endpoint for org.apache.flink.runtime.dispatcher.StandaloneDispatcher
at akka://flink/user/dispatcher .
2018-08-29 11:41:51,230 INFO
org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  -
Starting ZooKeeperLeaderElectionService
ZooKeeperLeaderElectionService{leaderPath='/leader/resource_manager_lock'}.
2018-08-29 11:41:51,232 INFO
org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  -
Starting ZooKeeperLeaderRetrievalService /leader/resource_manager_lock.
2018-08-29 11:41:51,234 INFO
org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  -
Starting ZooKeeperLeaderElectionService
ZooKeeperLeaderElectionService{leaderPath='/leader/dispatcher_lock'}.
2018-08-29 11:41:51,235 INFO
org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  -
Starting ZooKeeperLeaderRetrievalService /leader/dispatcher_lock.
2018-08-29 11:41:51,253 INFO
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager  -
ResourceManager akka.tcp://flink@flink-jobmanager-1:50010/user/resourcemanager
was granted leadership with fencing token ba47ed8daa8ff16bea6fc355c13f4d49
2018-08-29 11:41:51,254 INFO
org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager  -
Starting the SlotManager.
2018-08-29 11:41:51,263 INFO
org.apache.flink.runtime.dispatcher.StandaloneDispatcher      - Dispatcher
akka.tcp://flink@flink-jobmanager-1:50010/user/dispatcher was granted
leadership with fencing token 703301bf-85e7-4464-990f-ad39128a7b4d
2018-08-29 11:41:51,263 INFO
org.apache.flink.runtime.dispatcher.StandaloneDispatcher      - Recovering
all persisted jobs.
2018-08-29 11:41:51,468 INFO
org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager  -
Registering TaskManager c8a3201d58d87dbbe16f8eb352b5c5b6 under
1c5bf0bc3848bd384b6f032ff7213754 at the SlotManager.
2018-08-29 11:41:51,471 INFO
org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager  -
Registering TaskManager 104d18b72fed054620e58e120a1ea083 under
e9d3e8ad3b477dd2e58bcb88a2c0d061 at the SlotManager.

*Starting Jobmanager-2:*

Starting Job Manager
sed: cannot rename /opt/flink/conf/sedH2ZiSu: Device or resource busy
config file:
jobmanager.rpc.address: flink-jobmanager-2
jobmanager.rpc.port: 6123
jobmanager.heap.size: 8192
taskmanager.heap.size: 8192
taskmanager.numberOfTaskSlots: 4
high-availability: zookeeper
high-availability.storageDir:
hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/high-availability
high-availability.zookeeper.quorum: zk-cs:2181
high-availability.zookeeper.path.root: /flink
high-availability.jobmanager.port: 50010
state.backend: filesystem
state.checkpoints.dir:
hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/checkpoints
state.savepoints.dir:
hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/savepoints
state.backend.incremental: false
fs.default-scheme:
hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020
rest.port: 8081
web.upload.dir: /opt/flink/upload
query.server.port: 6125
taskmanager.numberOfTaskSlots: 4
classloader.parent-first-patterns.additional: org.apache.xerces.
blob.storage.directory: /opt/flink/blob-server
blob.server.port: 6124
blob.server.port: 6124
query.server.port: 6125
Starting standalonesession as a console application on host
flink-jobmanager-2-7844b78c9-kmvw9.
2018-08-29 11:41:51,688 INFO
org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
--------------------------------------------------------------------------------
2018-08-29 11:41:51,690 INFO
org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Starting
StandaloneSessionClusterEntrypoint (Version: 1.5.3, Rev:614f216,
Date:16.08.2018 @ 06:39:50 GMT)
2018-08-29 11:41:51,690 INFO
org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  OS current
user: flink
2018-08-29 11:41:52,018 WARN  org.apache.hadoop.util.NativeCodeLoader
                 - Unable to load native-hadoop library for your
platform... using builtin-java classes where applicable
2018-08-29 11:41:52,088 INFO
org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Current
Hadoop/Kerberos user: flink
2018-08-29 11:41:52,088 INFO
org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  JVM:
OpenJDK 64-Bit Server VM - Oracle Corporation - 1.8/25.181-b13
2018-08-29 11:41:52,088 INFO
org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Maximum
heap size: 6702 MiBytes
2018-08-29 11:41:52,088 INFO
org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  JAVA_HOME:
/docker-java-home/jre
2018-08-29 11:41:52,091 INFO
org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Hadoop
version: 2.7.5
2018-08-29 11:41:52,091 INFO
org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  JVM
Options:
2018-08-29 11:41:52,091 INFO
org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
 -Dlog4j.configuration=file:/opt/flink/conf/log4j-console.properties
2018-08-29 11:41:52,091 INFO
org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
 -Dlogback.configurationFile=file:/opt/flink/conf/logback-console.xml
2018-08-29 11:41:52,091 INFO
org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Program
Arguments:
2018-08-29 11:41:52,091 INFO
org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
 --configDir
2018-08-29 11:41:52,091 INFO
org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
 /opt/flink/conf
2018-08-29 11:41:52,091 INFO
org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
 --executionMode
2018-08-29 11:41:52,091 INFO
org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     cluster
2018-08-29 11:41:52,091 INFO
org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     --host
2018-08-29 11:41:52,091 INFO
org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     cluster
2018-08-29 11:41:52,091 INFO
org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Classpath:
/opt/flink/lib/flink-python_2.11-1.5.3.jar:/opt/flink/lib/flink-shaded-hadoop2-uber-1.5.3.jar:/opt/flink/lib/log4j-1.2.17.jar:/opt/flink/lib/slf4j-log4j12-1.7.7.jar:/opt/flink/lib/flink-dist_2.11-1.5.3.jar:::
2018-08-29 11:41:52,091 INFO
org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
--------------------------------------------------------------------------------
2018-08-29 11:41:52,092 INFO
org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Registered
UNIX signal handlers for [TERM, HUP, INT]
2018-08-29 11:41:52,103 INFO
org.apache.flink.configuration.GlobalConfiguration            - Loading
configuration property: jobmanager.rpc.address, flink-jobmanager-2
2018-08-29 11:41:52,103 INFO
org.apache.flink.configuration.GlobalConfiguration            - Loading
configuration property: jobmanager.rpc.port, 6123
2018-08-29 11:41:52,103 INFO
org.apache.flink.configuration.GlobalConfiguration            - Loading
configuration property: jobmanager.heap.size, 8192
2018-08-29 11:41:52,104 INFO
org.apache.flink.configuration.GlobalConfiguration            - Loading
configuration property: taskmanager.heap.size, 8192
2018-08-29 11:41:52,104 INFO
org.apache.flink.configuration.GlobalConfiguration            - Loading
configuration property: taskmanager.numberOfTaskSlots, 4
2018-08-29 11:41:52,104 INFO
org.apache.flink.configuration.GlobalConfiguration            - Loading
configuration property: high-availability, zookeeper
2018-08-29 11:41:52,104 INFO
org.apache.flink.configuration.GlobalConfiguration            - Loading
configuration property: high-availability.storageDir,
hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/high-availability
2018-08-29 11:41:52,104 INFO
org.apache.flink.configuration.GlobalConfiguration            - Loading
configuration property: high-availability.zookeeper.quorum, zk-cs:2181
2018-08-29 11:41:52,104 INFO
org.apache.flink.configuration.GlobalConfiguration            - Loading
configuration property: high-availability.zookeeper.path.root, /flink
2018-08-29 11:41:52,105 INFO
org.apache.flink.configuration.GlobalConfiguration            - Loading
configuration property: high-availability.jobmanager.port, 50010
2018-08-29 11:41:52,105 INFO
org.apache.flink.configuration.GlobalConfiguration            - Loading
configuration property: state.backend, filesystem
2018-08-29 11:41:52,105 INFO
org.apache.flink.configuration.GlobalConfiguration            - Loading
configuration property: state.checkpoints.dir,
hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/checkpoints
2018-08-29 11:41:52,105 INFO
org.apache.flink.configuration.GlobalConfiguration            - Loading
configuration property: state.savepoints.dir,
hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/savepoints
2018-08-29 11:41:52,105 INFO
org.apache.flink.configuration.GlobalConfiguration            - Loading
configuration property: state.backend.incremental, false
2018-08-29 11:41:52,106 INFO
org.apache.flink.configuration.GlobalConfiguration            - Loading
configuration property: fs.default-scheme,
hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020
2018-08-29 11:41:52,106 INFO
org.apache.flink.configuration.GlobalConfiguration            - Loading
configuration property: rest.port, 8081
2018-08-29 11:41:52,106 INFO
org.apache.flink.configuration.GlobalConfiguration            - Loading
configuration property: web.upload.dir, /opt/flink/upload
2018-08-29 11:41:52,106 INFO
org.apache.flink.configuration.GlobalConfiguration            - Loading
configuration property: query.server.port, 6125
2018-08-29 11:41:52,106 INFO
org.apache.flink.configuration.GlobalConfiguration            - Loading
configuration property: taskmanager.numberOfTaskSlots, 4
2018-08-29 11:41:52,107 INFO
org.apache.flink.configuration.GlobalConfiguration            - Loading
configuration property: classloader.parent-first-patterns.additional,
org.apache.xerces.
2018-08-29 11:41:52,107 INFO
org.apache.flink.configuration.GlobalConfiguration            - Loading
configuration property: blob.storage.directory, /opt/flink/blob-server
2018-08-29 11:41:52,107 INFO
org.apache.flink.configuration.GlobalConfiguration            - Loading
configuration property: blob.server.port, 6124
2018-08-29 11:41:52,107 INFO
org.apache.flink.configuration.GlobalConfiguration            - Loading
configuration property: blob.server.port, 6124
2018-08-29 11:41:52,107 INFO
org.apache.flink.configuration.GlobalConfiguration            - Loading
configuration property: query.server.port, 6125
2018-08-29 11:41:52,122 INFO
org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Starting
StandaloneSessionClusterEntrypoint.
2018-08-29 11:41:52,123 INFO
org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Install
default filesystem.
2018-08-29 11:41:52,133 INFO
org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Install
security context.
2018-08-29 11:41:52,173 INFO
org.apache.flink.runtime.security.modules.HadoopModule        - Hadoop user
set to flink (auth:SIMPLE)
2018-08-29 11:41:52,188 INFO
org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
Initializing cluster services.
2018-08-29 11:41:52,198 INFO
org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Trying to
start actor system at flink-jobmanager-2:50010
2018-08-29 11:41:52,753 INFO  akka.event.slf4j.Slf4jLogger
                - Slf4jLogger started
2018-08-29 11:41:52,822 INFO  akka.remote.Remoting
                - Starting remoting
2018-08-29 11:41:53,038 INFO  akka.remote.Remoting
                - Remoting started; listening on addresses
:[akka.tcp://flink@flink-jobmanager-2:50010]
2018-08-29 11:41:53,046 INFO
org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Actor
system started at akka.tcp://flink@flink-jobmanager-2:50010
2018-08-29 11:41:53,500 INFO
org.apache.flink.runtime.blob.FileSystemBlobStore             - Creating
highly available BLOB storage directory at
hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/high-availability//default/blob
2018-08-29 11:41:53,558 INFO  org.apache.flink.runtime.util.ZooKeeperUtils
                - Enforcing default ACL for ZK connections
2018-08-29 11:41:53,559 INFO  org.apache.flink.runtime.util.ZooKeeperUtils
                - Using '/flink/default' as Zookeeper namespace.
2018-08-29 11:41:53,616 INFO
org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl
- Starting
2018-08-29 11:41:53,624 INFO
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
environment:zookeeper.version=3.4.10-39d3a4f269333c922ed3db283be479f9deacaa0f,
built on 03/23/2017 10:13 GMT
2018-08-29 11:41:53,625 INFO
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
environment:host.name=flink-jobmanager-2-7844b78c9-kmvw9
2018-08-29 11:41:53,625 INFO
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
environment:java.version=1.8.0_181
2018-08-29 11:41:53,625 INFO
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
environment:java.vendor=Oracle Corporation
2018-08-29 11:41:53,625 INFO
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
environment:java.home=/usr/lib/jvm/java-8-openjdk-amd64/jre
2018-08-29 11:41:53,625 INFO
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
environment:java.class.path=/opt/flink/lib/flink-python_2.11-1.5.3.jar:/opt/flink/lib/flink-shaded-hadoop2-uber-1.5.3.jar:/opt/flink/lib/log4j-1.2.17.jar:/opt/flink/lib/slf4j-log4j12-1.7.7.jar:/opt/flink/lib/flink-dist_2.11-1.5.3.jar:::
2018-08-29 11:41:53,625 INFO
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
environment:java.library.path=/usr/java/packages/lib/amd64:/usr/lib/x86_64-linux-gnu/jni:/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu:/usr/lib/jni:/lib:/usr/lib
2018-08-29 11:41:53,625 INFO
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
environment:java.io.tmpdir=/tmp
2018-08-29 11:41:53,625 INFO
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
environment:java.compiler=<NA>
2018-08-29 11:41:53,625 INFO
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
environment:os.name=Linux
2018-08-29 11:41:53,625 INFO
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
environment:os.arch=amd64
2018-08-29 11:41:53,625 INFO
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
environment:os.version=4.4.0-1027-gke
2018-08-29 11:41:53,625 INFO
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
environment:user.name=flink
2018-08-29 11:41:53,625 INFO
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
environment:user.home=/opt/flink
2018-08-29 11:41:53,626 INFO
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
environment:user.dir=/opt/flink
2018-08-29 11:41:53,626 INFO
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  -
Initiating client connection, connectString=zk-cs:2181 sessionTimeout=60000
watcher=org.apache.flink.shaded.curator.org.apache.curator.ConnectionState@17ae7628
2018-08-29 11:41:53,644 WARN
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - SASL
configuration failed: javax.security.auth.login.LoginException: No JAAS
configuration section named 'Client' was found in specified JAAS
configuration file: '/tmp/jaas-8238466329925822361.conf'. Will continue
connection to Zookeeper server without SASL authentication, if Zookeeper
server allows it.
2018-08-29 11:41:53,646 INFO
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  -
Opening socket connection to server zk-cs.default.svc.cluster.local/
10.27.248.104:2181
2018-08-29 11:41:53,646 INFO  org.apache.flink.runtime.blob.BlobServer
                - Created BLOB server storage directory
/opt/flink/blob-server/blobStore-61cdb645-5d0c-47fd-bcf6-84ad16fadade
2018-08-29 11:41:53,646 ERROR
org.apache.flink.shaded.curator.org.apache.curator.ConnectionState  -
Authentication failed
2018-08-29 11:41:53,647 INFO
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Socket
connection established to zk-cs.default.svc.cluster.local/10.27.248.104:2181,
initiating session
2018-08-29 11:41:53,649 INFO  org.apache.flink.runtime.blob.BlobServer
                - Started BLOB server at 0.0.0.0:6124 - max concurrent
requests: 50 - max backlog: 1000
2018-08-29 11:41:53,655 INFO
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  -
Session establishment complete on server zk-cs.default.svc.cluster.local/
10.27.248.104:2181, sessionid = 0x26584fd55690006, negotiated timeout =
40000
2018-08-29 11:41:53,656 INFO
org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager
- State change: CONNECTED
2018-08-29 11:41:53,667 INFO
org.apache.flink.runtime.metrics.MetricRegistryImpl           - No metrics
reporter configured, no metrics will be exposed/reported.
2018-08-29 11:41:53,673 INFO
org.apache.flink.runtime.dispatcher.FileArchivedExecutionGraphStore  -
Initializing FileArchivedExecutionGraphStore: Storage directory
/tmp/executionGraphStore-8b236c14-79ee-4a84-b23f-437408c4661a, expiration
time 3600000, maximum cache size 52428800 bytes.
2018-08-29 11:41:53,699 INFO
org.apache.flink.runtime.blob.TransientBlobCache              - Created
BLOB cache storage directory
/opt/flink/blob-server/blobStore-80c519df-cc6f-4e9c-9cd5-da4077c826f0
2018-08-29 11:41:53,717 WARN  org.apache.flink.configuration.Configuration
                - Config uses deprecated configuration key
'jobmanager.rpc.address' instead of proper key 'rest.address'
2018-08-29 11:41:53,718 WARN
org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - Upload
directory /opt/flink/upload/flink-web-upload does not exist, or has been
deleted externally. Previously uploaded files are no longer available.
2018-08-29 11:41:53,719 INFO
org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - Created
directory /opt/flink/upload/flink-web-upload for file uploads.
2018-08-29 11:41:53,722 INFO
org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - Starting
rest endpoint.
2018-08-29 11:41:54,084 WARN
org.apache.flink.runtime.webmonitor.WebMonitorUtils           - Log file
environment variable 'log.file' is not set.
2018-08-29 11:41:54,084 WARN
org.apache.flink.runtime.webmonitor.WebMonitorUtils           - JobManager
log files are unavailable in the web dashboard. Log file location not found
in environment variable 'log.file' or configuration key 'Key:
'web.log.path' , default: null (deprecated keys:
[jobmanager.web.log.path])'.
2018-08-29 11:41:54,160 INFO
org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - Rest
endpoint listening at flink-jobmanager-2:8081
2018-08-29 11:41:54,160 INFO
org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  -
Starting ZooKeeperLeaderElectionService
ZooKeeperLeaderElectionService{leaderPath='/leader/rest_server_lock'}.
2018-08-29 11:41:54,180 INFO
org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - Web
frontend listening at http://flink-jobmanager-2:8081.
2018-08-29 11:41:54,192 INFO
org.apache.flink.runtime.rpc.akka.AkkaRpcService              - Starting
RPC endpoint for
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager at
akka://flink/user/resourcemanager .
2018-08-29 11:41:54,273 INFO
org.apache.flink.runtime.rpc.akka.AkkaRpcService              - Starting
RPC endpoint for org.apache.flink.runtime.dispatcher.StandaloneDispatcher
at akka://flink/user/dispatcher .
2018-08-29 11:41:54,286 INFO
org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  -
Starting ZooKeeperLeaderElectionService
ZooKeeperLeaderElectionService{leaderPath='/leader/resource_manager_lock'}.
2018-08-29 11:41:54,287 INFO
org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  -
Starting ZooKeeperLeaderRetrievalService /leader/resource_manager_lock.
2018-08-29 11:41:54,289 INFO
org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  -
Starting ZooKeeperLeaderElectionService
ZooKeeperLeaderElectionService{leaderPath='/leader/dispatcher_lock'}.
2018-08-29 11:41:54,289 INFO
org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  -
Starting ZooKeeperLeaderRetrievalService /leader/dispatcher_lock.

*Upon submitting a batch job on Jobmanager-1, we immediately get this log
on Jobmanager-2*
2018-08-29 11:47:06,249 INFO
org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  -
Recovered SubmittedJobGraph(d69b67e4d28a2d244b06d3f6d661bca1, null).

*Meanwhile Jobmanager-1 gets:*
*-FlinkBatchPipelineTranslator pipeline logs- (we use Apache Beam)*

2018-08-29 11:47:06,006 INFO
org.apache.flink.runtime.dispatcher.StandaloneDispatcher      - Submitting
job d69b67e4d28a2d244b06d3f6d661bca1
(sicassandrawriterbeam-flink-0829114703-7d95fabd).
2018-08-29 11:47:06,090 INFO
org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  -
Added SubmittedJobGraph(d69b67e4d28a2d244b06d3f6d661bca1, null) to
ZooKeeper.

*-loads of job execution info-*

2018-08-29 11:49:20,272 INFO
org.apache.flink.runtime.dispatcher.StandaloneDispatcher      - Job
d69b67e4d28a2d244b06d3f6d661bca1 reached globally terminal state FINISHED.
2018-08-29 11:49:20,286 INFO  org.apache.flink.runtime.jobmaster.JobMaster
                - Stopping the JobMaster for job
sicassandrawriterbeam-flink-0829114703-7d95fabd(d69b67e4d28a2d244b06d3f6d661bca1).
2018-08-29 11:49:20,290 INFO
org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  -
Stopping ZooKeeperLeaderRetrievalService /leader/resource_manager_lock.
2018-08-29 11:49:20,292 INFO  org.apache.flink.runtime.jobmaster.JobMaster
                - Close ResourceManager connection
827b94881bf7c94d8516907e04e3a564: JobManager is shutting down..
2018-08-29 11:49:20,292 INFO
org.apache.flink.runtime.jobmaster.slotpool.SlotPool          - Suspending
SlotPool.
2018-08-29 11:49:20,293 INFO
org.apache.flink.runtime.jobmaster.slotpool.SlotPool          - Stopping
SlotPool.
2018-08-29 11:49:20,293 INFO
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager  -
Disconnect job manager a3dab0a0883c5f0f37943358d9104d79
@akka.tcp://flink@flink-jobmanager-1:50010/user/jobmanager_0 for job
d69b67e4d28a2d244b06d3f6d661bca1 from the resource manager.
2018-08-29 11:49:20,293 INFO
org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  -
Stopping ZooKeeperLeaderElectionService
ZooKeeperLeaderElectionService{leaderPath='/leader/d69b67e4d28a2d244b06d3f6d661bca1/job_manager_lock'}.
2018-08-29 11:49:20,304 INFO
org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  -
Removed job graph d69b67e4d28a2d244b06d3f6d661bca1 from ZooKeeper.


-------------------

The result is:
HDFS has only a jobgraph and an empty default folder - everything else is
cleared
ZooKeeper has the jobgraph that Jobmanager-1 claims to have removed in the
last log still there.

On Wed, Aug 29, 2018 at 12:14 PM Till Rohrmann <trohrm...@apache.org> wrote:

> Hi Encho,
>
> it sounds strange that the standby JobManager tries to recover a submitted
> job graph. This should only happen if it has been granted leadership. Thus,
> it seems as if the standby JobManager thinks that it is also the leader.
> Could you maybe share the logs of the two JobManagers/ClusterEntrypoints
> with us?
>
> Running only a single JobManager/ClusterEntrypoint in HA mode via a
> Kubernetes Deployment should do the trick and there is nothing wrong with
> it.
>
> Cheers,
> Till
>
> On Wed, Aug 29, 2018 at 11:05 AM Encho Mishinev <encho.mishi...@gmail.com>
> wrote:
>
>> Hello,
>>
>> Since two job managers don't seem to be working for me I was thinking of
>> just using a single job manager in Kubernetes in HA mode with a deployment
>> ensuring its restart whenever it fails. Is this approach viable? The
>> High-Availability page mentions that you use only one job manager in an
>> YARN cluster but does not specify such option for Kubernetes. Is there
>> anything that can go wrong with this approach?
>>
>> Thanks
>>
>> On Wed, Aug 29, 2018 at 11:10 AM Encho Mishinev <encho.mishi...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> Unfortunately the thing I described does indeed happen every time. As
>>> mentioned in the first email, I am running on Kubernetes so certain things
>>> could be different compared to just a standalone cluster.
>>>
>>> Any ideas for workarounds are welcome, as this problem basically
>>> prevents me from using HA.
>>>
>>> Thanks,
>>> Encho
>>>
>>> On Wed, Aug 29, 2018 at 5:15 AM vino yang <yanghua1...@gmail.com> wrote:
>>>
>>>> Hi Encho,
>>>>
>>>> From your description, I feel that there are extra bugs.
>>>>
>>>> About your description:
>>>>
>>>> *- Start both job managers*
>>>> *- Start a batch job in JobManager 1 and let it finish*
>>>> *The jobgraphs in both Zookeeper and HDFS remained.*
>>>>
>>>> Is it necessarily happening every time?
>>>>
>>>> In the Standalone cluster, the problems we encountered were sporadic.
>>>>
>>>> Thanks, vino.
>>>>
>>>> Encho Mishinev <encho.mishi...@gmail.com> 于2018年8月28日周二 下午8:07写道：
>>>>
>>>>> Hello Till,
>>>>>
>>>>> I spend a few more hours testing and looking at the logs and it seems
>>>>> like there's a more general problem here. While the two job managers are
>>>>> active neither of them can properly delete jobgraphs. The above problem I
>>>>> described comes from the fact that Kubernetes gets JobManager 1 quickly
>>>>> after I manually kill it, so when I stop the job on JobManager 2 both are
>>>>> alive.
>>>>>
>>>>> I did a very simple test:
>>>>>
>>>>> - Start both job managers
>>>>> - Start a batch job in JobManager 1 and let it finish
>>>>> The jobgraphs in both Zookeeper and HDFS remained.
>>>>>
>>>>> On the other hand if we do:
>>>>>
>>>>> - Start only JobManager 1 (again in HA mode)
>>>>> - Start a batch job and let it finish
>>>>> The jobgraphs in both Zookeeper and HDFS are deleted fine.
>>>>>
>>>>> It seems like the standby manager still leaves some kind of lock on
>>>>> the jobgraphs. Do you think that's possible? Have you seen a similar
>>>>> problem?
>>>>> The only logs that appear on the standby manager while waiting are of
>>>>> the type:
>>>>>
>>>>> 2018-08-28 11:54:10,789 INFO
>>>>> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  -
>>>>> Recovered SubmittedJobGraph(9e0a109b57511930c95d3b54574a66e3, null).
>>>>>
>>>>> Note that this log appears on the standby jobmanager immediately when
>>>>> a new job is submitted to the active jobmanager.
>>>>> Also note that the blobs and checkpoints are cleared fine. The problem
>>>>> is only for jobgraphs both in ZooKeeper and HDFS.
>>>>>
>>>>> Trying to access the UI of the standby manager redirects to the active
>>>>> one, so it is not a problem of them not knowing who the leader is. Do you
>>>>> have any ideas?
>>>>>
>>>>> Thanks a lot,
>>>>> Encho
>>>>>
>>>>> On Tue, Aug 28, 2018 at 10:27 AM Till Rohrmann <trohrm...@apache.org>
>>>>> wrote:
>>>>>
>>>>>> Hi Encho,
>>>>>>
>>>>>> thanks a lot for reporting this issue. The problem arises whenever
>>>>>> the old leader maintains the connection to ZooKeeper. If this is the 
>>>>>> case,
>>>>>> then ephemeral nodes which we create to protect against faulty delete
>>>>>> operations are not removed and consequently the new leader is not able to
>>>>>> delete the persisted job graph. So one thing to check is whether the old 
>>>>>> JM
>>>>>> still has an open connection to ZooKeeper. The next thing to check is the
>>>>>> session timeout of your ZooKeeper cluster. If you stop the job within the
>>>>>> session timeout, then it is also not guaranteed that ZooKeeper has 
>>>>>> detected
>>>>>> that the ephemeral nodes of the old JM must be deleted. In order to
>>>>>> understand this better it would be helpful if you could tell us the 
>>>>>> timing
>>>>>> of the different actions.
>>>>>>
>>>>>> Cheers,
>>>>>> Till
>>>>>>
>>>>>> On Tue, Aug 28, 2018 at 8:17 AM vino yang <yanghua1...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Encho,
>>>>>>>
>>>>>>> A temporary solution can be used to determine if it has been cleaned
>>>>>>> up by monitoring the specific JobID under Zookeeper's "/jobgraph".
>>>>>>> Another solution, modify the source code, rudely modify the cleanup
>>>>>>> mode to the synchronous form, but the flink operation Zookeeper's path
>>>>>>> needs to obtain the corresponding lock, so it is dangerous to do so, 
>>>>>>> and it
>>>>>>> is not recommended.
>>>>>>> I think maybe this problem can be solved in the next version. It
>>>>>>> depends on Till.
>>>>>>>
>>>>>>> Thanks, vino.
>>>>>>>
>>>>>>> Encho Mishinev <encho.mishi...@gmail.com> 于2018年8月28日周二 下午1:17写道：
>>>>>>>
>>>>>>>> Thank you very much for the info! Will keep track of the progress.
>>>>>>>>
>>>>>>>> In the meantime is there any viable workaround? It seems like HA
>>>>>>>> doesn't really work due to this bug.
>>>>>>>>
>>>>>>>> On Tue, Aug 28, 2018 at 4:52 AM vino yang <yanghua1...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> About some implementation mechanisms.
>>>>>>>>> Flink uses Zookeeper to store JobGraph (Job's description
>>>>>>>>> information and metadata) as a basis for Job recovery.
>>>>>>>>> However, previous implementations may cause this information to
>>>>>>>>> not be properly cleaned up because it is asynchronously deleted by a
>>>>>>>>> background thread.
>>>>>>>>>
>>>>>>>>> Thanks, vino.
>>>>>>>>>
>>>>>>>>> vino yang <yanghua1...@gmail.com> 于2018年8月28日周二 上午9:49写道：
>>>>>>>>>
>>>>>>>>>> Hi Encho,
>>>>>>>>>>
>>>>>>>>>> This is a problem already known to the Flink community, you can
>>>>>>>>>> track its progress through FLINK-10011[1], and currently Till is 
>>>>>>>>>> fixing
>>>>>>>>>> this issue.
>>>>>>>>>>
>>>>>>>>>> [1]: https://issues.apache.org/jira/browse/FLINK-10011
>>>>>>>>>>
>>>>>>>>>> Thanks, vino.
>>>>>>>>>>
>>>>>>>>>> Encho Mishinev <encho.mishi...@gmail.com> 于2018年8月27日周一
>>>>>>>>>> 下午10:13写道：
>>>>>>>>>>
>>>>>>>>>>> I am running Flink 1.5.3 with two job managers and two task
>>>>>>>>>>> managers in Kubernetes along with HDFS and Zookeeper in 
>>>>>>>>>>> high-availability
>>>>>>>>>>> mode.
>>>>>>>>>>>
>>>>>>>>>>> My problem occurs after the following actions:
>>>>>>>>>>> - Upload a .jar file to jobmanager-1
>>>>>>>>>>> - Run a streaming job from the jar on jobmanager-1
>>>>>>>>>>> - Wait for 1 or 2 checkpoints to succeed
>>>>>>>>>>> - Kill pod of jobmanager-1
>>>>>>>>>>> After a short delay, jobmanager-2 takes leadership and correctly
>>>>>>>>>>> restores the job and continues it
>>>>>>>>>>> - Stop job from jobmanager-2
>>>>>>>>>>>
>>>>>>>>>>> At this point all seems well, but the problem is that
>>>>>>>>>>> jobmanager-2 does not clean up anything that was left from 
>>>>>>>>>>> jobmanager-1.
>>>>>>>>>>> This means that both in HDFS and in Zookeeper remain job graphs, 
>>>>>>>>>>> which
>>>>>>>>>>> later on obstruct any work of both managers as after any reset they
>>>>>>>>>>> unsuccessfully try to restore a non-existent job and fail over and 
>>>>>>>>>>> over
>>>>>>>>>>> again.
>>>>>>>>>>>
>>>>>>>>>>> I am quite certain that jobmanager-2 does not know about any of
>>>>>>>>>>> jobmanager-1’s files since the Zookeeper logs reveal that it tries 
>>>>>>>>>>> to
>>>>>>>>>>> duplicate job folders:
>>>>>>>>>>>
>>>>>>>>>>> 2018-08-27 13:11:00,038 [myid:] - INFO  [ProcessThread(sid:0
>>>>>>>>>>> cport:2181)::PrepRequestProcessor@648] - Got user-level
>>>>>>>>>>> KeeperException when processing sessionid:0x1657aa15e480033 
>>>>>>>>>>> type:create
>>>>>>>>>>> cxid:0x46 zxid:0x1ab txntype:-1 reqpath:n/a Error
>>>>>>>>>>> Path:/flink/default/jobgraphs/83bfa359ca59ce1d4635e18e16651e15/bbb259fd-7826-4950-bc7c-c2be23346c77
>>>>>>>>>>> Error:KeeperErrorCode = NodeExists for
>>>>>>>>>>> /flink/default/jobgraphs/83bfa359ca59ce1d4635e18e16651e15/bbb259fd-7826-4950-bc7c-c2be23346c77
>>>>>>>>>>>
>>>>>>>>>>> 2018-08-27 13:11:02,296 [myid:] - INFO  [ProcessThread(sid:0
>>>>>>>>>>> cport:2181)::PrepRequestProcessor@648] - Got user-level
>>>>>>>>>>> KeeperException when processing sessionid:0x1657aa15e480033 
>>>>>>>>>>> type:create
>>>>>>>>>>> cxid:0x5c zxid:0x1ac txntype:-1 reqpath:n/a Error
>>>>>>>>>>> Path:/flink/default/checkpoint-counter/83bfa359ca59ce1d4635e18e16651e15
>>>>>>>>>>> Error:KeeperErrorCode = NodeExists for
>>>>>>>>>>> /flink/default/checkpoint-counter/83bfa359ca59ce1d4635e18e16651e15
>>>>>>>>>>>
>>>>>>>>>>> Also jobmanager-2 attempts to delete the jobgraphs folder in
>>>>>>>>>>> Zookeeper when the job is stopped, but fails since there are 
>>>>>>>>>>> leftover files
>>>>>>>>>>> in it from jobmanager-1:
>>>>>>>>>>>
>>>>>>>>>>> 2018-08-27 13:12:13,406 [myid:] - INFO  [ProcessThread(sid:0
>>>>>>>>>>> cport:2181)::PrepRequestProcessor@648] - Got user-level
>>>>>>>>>>> KeeperException when processing sessionid:0x1657aa15e480033 
>>>>>>>>>>> type:delete
>>>>>>>>>>> cxid:0xa8 zxid:0x1bd txntype:-1 reqpath:n/a Error
>>>>>>>>>>> Path:/flink/default/jobgraphs/83bfa359ca59ce1d4635e18e16651e15
>>>>>>>>>>> Error:KeeperErrorCode = Directory not empty for
>>>>>>>>>>> /flink/default/jobgraphs/83bfa359ca59ce1d4635e18e16651e15
>>>>>>>>>>>
>>>>>>>>>>> I’ve noticed that when restoring the job, it seems like
>>>>>>>>>>> jobmanager-2 does not get anything more than jobID, while it 
>>>>>>>>>>> perhaps needs
>>>>>>>>>>> some metadata? Here is the log that seems suspicious to me:
>>>>>>>>>>>
>>>>>>>>>>> 2018-08-27 13:09:18,113 INFO
>>>>>>>>>>> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore 
>>>>>>>>>>>  -
>>>>>>>>>>> Recovered SubmittedJobGraph(83bfa359ca59ce1d4635e18e16651e15, null).
>>>>>>>>>>>
>>>>>>>>>>> All other logs seem fine in jobmanager-2, it doesn’t seem to be
>>>>>>>>>>> aware that it’s overwriting anything or not deleting properly.
>>>>>>>>>>>
>>>>>>>>>>> My question is - what is the intended way for the job managers
>>>>>>>>>>> to correctly exchange metadata in HA mode and why is it not working 
>>>>>>>>>>> for me?
>>>>>>>>>>>
>>>>>>>>>>> Thanks in advance!
>>>>>>>>>>
>>>>>>>>>>

Re: JobGraphs not cleaned up in HA mode

Reply via email to