Re: JobGraphs not cleaned up in HA mode

Till Rohrmann Wed, 29 Aug 2018 06:00:02 -0700

Hi Encho,

thanks for sending the first part of the logs. What I would actually be
interested in are the complete logs because somewhere in the jobmanager-2
logs there must be a log statement saying that the respective dispatcher
gained leadership. I would like to see why this happens but for this to
debug the complete logs are necessary. It would be awesome if you could
send them to me. Thanks a lot!


Cheers,
Till

On Wed, Aug 29, 2018 at 2:00 PM Encho Mishinev <encho.mishi...@gmail.com>
wrote:

> Hi Till,
>
> I will use the approach with a k8s deployment and HA mode with a single
> job manager. Nonetheless, here are the logs I just produced by repeating
> the aforementioned experiment, hope they help in debugging:
>
> *- Starting Jobmanager-1:*
>
> Starting Job Manager
> sed: cannot rename /opt/flink/conf/sedR98XPn: Device or resource busy
> config file:
> jobmanager.rpc.address: flink-jobmanager-1
> jobmanager.rpc.port: 6123
> jobmanager.heap.size: 8192
> taskmanager.heap.size: 8192
> taskmanager.numberOfTaskSlots: 4
> high-availability: zookeeper
> high-availability.storageDir:
> hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/high-availability
> high-availability.zookeeper.quorum: zk-cs:2181
> high-availability.zookeeper.path.root: /flink
> high-availability.jobmanager.port: 50010
> state.backend: filesystem
> state.checkpoints.dir:
> hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/checkpoints
> state.savepoints.dir:
> hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/savepoints
> state.backend.incremental: false
> fs.default-scheme:
> hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020
> rest.port: 8081
> web.upload.dir: /opt/flink/upload
> query.server.port: 6125
> taskmanager.numberOfTaskSlots: 4
> classloader.parent-first-patterns.additional: org.apache.xerces.
> blob.storage.directory: /opt/flink/blob-server
> blob.server.port: 6124
> blob.server.port: 6124
> query.server.port: 6125
> Starting standalonesession as a console application on host
> flink-jobmanager-1-f76fd4df8-ftwt9.
> 2018-08-29 11:41:48,806 INFO
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
> --------------------------------------------------------------------------------
> 2018-08-29 11:41:48,807 INFO
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Starting
> StandaloneSessionClusterEntrypoint (Version: 1.5.3, Rev:614f216,
> Date:16.08.2018 @ 06:39:50 GMT)
> 2018-08-29 11:41:48,807 INFO
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  OS current
> user: flink
> 2018-08-29 11:41:49,134 WARN  org.apache.hadoop.util.NativeCodeLoader
>                  - Unable to load native-hadoop library for your
> platform... using builtin-java classes where applicable
> 2018-08-29 11:41:49,210 INFO
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Current
> Hadoop/Kerberos user: flink
> 2018-08-29 11:41:49,210 INFO
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  JVM:
> OpenJDK 64-Bit Server VM - Oracle Corporation - 1.8/25.181-b13
> 2018-08-29 11:41:49,210 INFO
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Maximum
> heap size: 6702 MiBytes
> 2018-08-29 11:41:49,210 INFO
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  JAVA_HOME:
> /docker-java-home/jre
> 2018-08-29 11:41:49,213 INFO
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Hadoop
> version: 2.7.5
> 2018-08-29 11:41:49,213 INFO
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  JVM
> Options:
> 2018-08-29 11:41:49,213 INFO
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
>  -Dlog4j.configuration=file:/opt/flink/conf/log4j-console.properties
> 2018-08-29 11:41:49,213 INFO
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
>  -Dlogback.configurationFile=file:/opt/flink/conf/logback-console.xml
> 2018-08-29 11:41:49,213 INFO
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Program
> Arguments:
> 2018-08-29 11:41:49,213 INFO
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
>  --configDir
> 2018-08-29 11:41:49,213 INFO
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
>  /opt/flink/conf
> 2018-08-29 11:41:49,213 INFO
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
>  --executionMode
> 2018-08-29 11:41:49,213 INFO
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     cluster
> 2018-08-29 11:41:49,214 INFO
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     --host
> 2018-08-29 11:41:49,214 INFO
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     cluster
> 2018-08-29 11:41:49,214 INFO
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Classpath:
> /opt/flink/lib/flink-python_2.11-1.5.3.jar:/opt/flink/lib/flink-shaded-hadoop2-uber-1.5.3.jar:/opt/flink/lib/log4j-1.2.17.jar:/opt/flink/lib/slf4j-log4j12-1.7.7.jar:/opt/flink/lib/flink-dist_2.11-1.5.3.jar:::
> 2018-08-29 11:41:49,214 INFO
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
> --------------------------------------------------------------------------------
> 2018-08-29 11:41:49,215 INFO
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Registered
> UNIX signal handlers for [TERM, HUP, INT]
> 2018-08-29 11:41:49,221 INFO
> org.apache.flink.configuration.GlobalConfiguration            - Loading
> configuration property: jobmanager.rpc.address, flink-jobmanager-1
> 2018-08-29 11:41:49,221 INFO
> org.apache.flink.configuration.GlobalConfiguration            - Loading
> configuration property: jobmanager.rpc.port, 6123
> 2018-08-29 11:41:49,221 INFO
> org.apache.flink.configuration.GlobalConfiguration            - Loading
> configuration property: jobmanager.heap.size, 8192
> 2018-08-29 11:41:49,221 INFO
> org.apache.flink.configuration.GlobalConfiguration            - Loading
> configuration property: taskmanager.heap.size, 8192
> 2018-08-29 11:41:49,221 INFO
> org.apache.flink.configuration.GlobalConfiguration            - Loading
> configuration property: taskmanager.numberOfTaskSlots, 4
> 2018-08-29 11:41:49,222 INFO
> org.apache.flink.configuration.GlobalConfiguration            - Loading
> configuration property: high-availability, zookeeper
> 2018-08-29 11:41:49,222 INFO
> org.apache.flink.configuration.GlobalConfiguration            - Loading
> configuration property: high-availability.storageDir,
> hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/high-availability
> 2018-08-29 11:41:49,222 INFO
> org.apache.flink.configuration.GlobalConfiguration            - Loading
> configuration property: high-availability.zookeeper.quorum, zk-cs:2181
> 2018-08-29 11:41:49,222 INFO
> org.apache.flink.configuration.GlobalConfiguration            - Loading
> configuration property: high-availability.zookeeper.path.root, /flink
> 2018-08-29 11:41:49,223 INFO
> org.apache.flink.configuration.GlobalConfiguration            - Loading
> configuration property: high-availability.jobmanager.port, 50010
> 2018-08-29 11:41:49,223 INFO
> org.apache.flink.configuration.GlobalConfiguration            - Loading
> configuration property: state.backend, filesystem
> 2018-08-29 11:41:49,223 INFO
> org.apache.flink.configuration.GlobalConfiguration            - Loading
> configuration property: state.checkpoints.dir,
> hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/checkpoints
> 2018-08-29 11:41:49,223 INFO
> org.apache.flink.configuration.GlobalConfiguration            - Loading
> configuration property: state.savepoints.dir,
> hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/savepoints
> 2018-08-29 11:41:49,223 INFO
> org.apache.flink.configuration.GlobalConfiguration            - Loading
> configuration property: state.backend.incremental, false
> 2018-08-29 11:41:49,224 INFO
> org.apache.flink.configuration.GlobalConfiguration            - Loading
> configuration property: fs.default-scheme,
> hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020
> 2018-08-29 11:41:49,224 INFO
> org.apache.flink.configuration.GlobalConfiguration            - Loading
> configuration property: rest.port, 8081
> 2018-08-29 11:41:49,224 INFO
> org.apache.flink.configuration.GlobalConfiguration            - Loading
> configuration property: web.upload.dir, /opt/flink/upload
> 2018-08-29 11:41:49,224 INFO
> org.apache.flink.configuration.GlobalConfiguration            - Loading
> configuration property: query.server.port, 6125
> 2018-08-29 11:41:49,225 INFO
> org.apache.flink.configuration.GlobalConfiguration            - Loading
> configuration property: taskmanager.numberOfTaskSlots, 4
> 2018-08-29 11:41:49,225 INFO
> org.apache.flink.configuration.GlobalConfiguration            - Loading
> configuration property: classloader.parent-first-patterns.additional,
> org.apache.xerces.
> 2018-08-29 11:41:49,225 INFO
> org.apache.flink.configuration.GlobalConfiguration            - Loading
> configuration property: blob.storage.directory, /opt/flink/blob-server
> 2018-08-29 11:41:49,225 INFO
> org.apache.flink.configuration.GlobalConfiguration            - Loading
> configuration property: blob.server.port, 6124
> 2018-08-29 11:41:49,225 INFO
> org.apache.flink.configuration.GlobalConfiguration            - Loading
> configuration property: blob.server.port, 6124
> 2018-08-29 11:41:49,225 INFO
> org.apache.flink.configuration.GlobalConfiguration            - Loading
> configuration property: query.server.port, 6125
> 2018-08-29 11:41:49,239 INFO
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Starting
> StandaloneSessionClusterEntrypoint.
> 2018-08-29 11:41:49,239 INFO
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Install
> default filesystem.
> 2018-08-29 11:41:49,250 INFO
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Install
> security context.
> 2018-08-29 11:41:49,282 INFO
> org.apache.flink.runtime.security.modules.HadoopModule        - Hadoop user
> set to flink (auth:SIMPLE)
> 2018-08-29 11:41:49,298 INFO
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
> Initializing cluster services.
> 2018-08-29 11:41:49,309 INFO
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Trying to
> start actor system at flink-jobmanager-1:50010
> 2018-08-29 11:41:49,768 INFO  akka.event.slf4j.Slf4jLogger
>                   - Slf4jLogger started
> 2018-08-29 11:41:49,823 INFO  akka.remote.Remoting
>                   - Starting remoting
> 2018-08-29 11:41:49,974 INFO  akka.remote.Remoting
>                   - Remoting started; listening on addresses
> :[akka.tcp://flink@flink-jobmanager-1:50010]
> 2018-08-29 11:41:49,981 INFO
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Actor
> system started at akka.tcp://flink@flink-jobmanager-1:50010
> 2018-08-29 11:41:50,444 INFO
> org.apache.flink.runtime.blob.FileSystemBlobStore             - Creating
> highly available BLOB storage directory at
> hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/high-availability//default/blob
> 2018-08-29 11:41:50,509 INFO
> org.apache.flink.runtime.util.ZooKeeperUtils                  - Enforcing
> default ACL for ZK connections
> 2018-08-29 11:41:50,509 INFO
> org.apache.flink.runtime.util.ZooKeeperUtils                  - Using
> '/flink/default' as Zookeeper namespace.
> 2018-08-29 11:41:50,568 INFO
> org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl
> - Starting
> 2018-08-29 11:41:50,577 INFO
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
> environment:zookeeper.version=3.4.10-39d3a4f269333c922ed3db283be479f9deacaa0f,
> built on 03/23/2017 10:13 GMT
> 2018-08-29 11:41:50,577 INFO
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
> environment:host.name=flink-jobmanager-1-f76fd4df8-ftwt9
> 2018-08-29 11:41:50,577 INFO
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
> environment:java.version=1.8.0_181
> 2018-08-29 11:41:50,577 INFO
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
> environment:java.vendor=Oracle Corporation
> 2018-08-29 11:41:50,577 INFO
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
> environment:java.home=/usr/lib/jvm/java-8-openjdk-amd64/jre
> 2018-08-29 11:41:50,577 INFO
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
> environment:java.class.path=/opt/flink/lib/flink-python_2.11-1.5.3.jar:/opt/flink/lib/flink-shaded-hadoop2-uber-1.5.3.jar:/opt/flink/lib/log4j-1.2.17.jar:/opt/flink/lib/slf4j-log4j12-1.7.7.jar:/opt/flink/lib/flink-dist_2.11-1.5.3.jar:::
> 2018-08-29 11:41:50,577 INFO
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
> environment:java.library.path=/usr/java/packages/lib/amd64:/usr/lib/x86_64-linux-gnu/jni:/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu:/usr/lib/jni:/lib:/usr/lib
> 2018-08-29 11:41:50,577 INFO
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
> environment:java.io.tmpdir=/tmp
> 2018-08-29 11:41:50,577 INFO
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
> environment:java.compiler=<NA>
> 2018-08-29 11:41:50,577 INFO
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
> environment:os.name=Linux
> 2018-08-29 11:41:50,577 INFO
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
> environment:os.arch=amd64
> 2018-08-29 11:41:50,577 INFO
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
> environment:os.version=4.4.0-1027-gke
> 2018-08-29 11:41:50,578 INFO
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
> environment:user.name=flink
> 2018-08-29 11:41:50,578 INFO
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
> environment:user.home=/opt/flink
> 2018-08-29 11:41:50,578 INFO
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
> environment:user.dir=/opt/flink
> 2018-08-29 11:41:50,578 INFO
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  -
> Initiating client connection, connectString=zk-cs:2181 sessionTimeout=60000
> watcher=org.apache.flink.shaded.curator.org.apache.curator.ConnectionState@17ae7628
> 2018-08-29 11:41:50,605 INFO  org.apache.flink.runtime.blob.BlobServer
>                   - Created BLOB server storage directory
> /opt/flink/blob-server/blobStore-d408cea8-2ed0-461a-a30a-a62b70fd332a
> 2018-08-29 11:41:50,605 WARN
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - SASL
> configuration failed: javax.security.auth.login.LoginException: No JAAS
> configuration section named 'Client' was found in specified JAAS
> configuration file: '/tmp/jaas-5372401662150571998.conf'. Will continue
> connection to Zookeeper server without SASL authentication, if Zookeeper
> server allows it.
> 2018-08-29 11:41:50,607 INFO  org.apache.flink.runtime.blob.BlobServer
>                   - Started BLOB server at 0.0.0.0:6124 - max concurrent
> requests: 50 - max backlog: 1000
> 2018-08-29 11:41:50,607 INFO
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  -
> Opening socket connection to server zk-cs.default.svc.cluster.local/
> 10.27.248.104:2181
> 2018-08-29 11:41:50,608 ERROR
> org.apache.flink.shaded.curator.org.apache.curator.ConnectionState  -
> Authentication failed
> 2018-08-29 11:41:50,609 INFO
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Socket
> connection established to zk-cs.default.svc.cluster.local/
> 10.27.248.104:2181, initiating session
> 2018-08-29 11:41:50,618 INFO
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  -
> Session establishment complete on server zk-cs.default.svc.cluster.local/
> 10.27.248.104:2181, sessionid = 0x26584fd55690005, negotiated timeout =
> 40000
> 2018-08-29 11:41:50,619 INFO
> org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager
> - State change: CONNECTED
> 2018-08-29 11:41:50,627 INFO
> org.apache.flink.runtime.metrics.MetricRegistryImpl           - No metrics
> reporter configured, no metrics will be exposed/reported.
> 2018-08-29 11:41:50,633 INFO
> org.apache.flink.runtime.dispatcher.FileArchivedExecutionGraphStore  -
> Initializing FileArchivedExecutionGraphStore: Storage directory
> /tmp/executionGraphStore-c5df0b39-86f3-4fba-bdda-aacca4f86086, expiration
> time 3600000, maximum cache size 52428800 bytes.
> 2018-08-29 11:41:50,659 INFO
> org.apache.flink.runtime.blob.TransientBlobCache              - Created
> BLOB cache storage directory
> /opt/flink/blob-server/blobStore-c12d55af-3c2d-4fc2-8ee8-6de642522184
> 2018-08-29 11:41:50,674 WARN
> org.apache.flink.configuration.Configuration                  - Config uses
> deprecated configuration key 'jobmanager.rpc.address' instead of proper key
> 'rest.address'
> 2018-08-29 11:41:50,675 WARN
> org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - Upload
> directory /opt/flink/upload/flink-web-upload does not exist, or has been
> deleted externally. Previously uploaded files are no longer available.
> 2018-08-29 11:41:50,676 INFO
> org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - Created
> directory /opt/flink/upload/flink-web-upload for file uploads.
> 2018-08-29 11:41:50,679 INFO
> org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - Starting
> rest endpoint.
> 2018-08-29 11:41:50,995 WARN
> org.apache.flink.runtime.webmonitor.WebMonitorUtils           - Log file
> environment variable 'log.file' is not set.
> 2018-08-29 11:41:50,995 WARN
> org.apache.flink.runtime.webmonitor.WebMonitorUtils           - JobManager
> log files are unavailable in the web dashboard. Log file location not found
> in environment variable 'log.file' or configuration key 'Key:
> 'web.log.path' , default: null (deprecated keys:
> [jobmanager.web.log.path])'.
> 2018-08-29 11:41:51,071 INFO
> org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - Rest
> endpoint listening at flink-jobmanager-1:8081
> 2018-08-29 11:41:51,071 INFO
> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  -
> Starting ZooKeeperLeaderElectionService
> ZooKeeperLeaderElectionService{leaderPath='/leader/rest_server_lock'}.
> 2018-08-29 11:41:51,091 WARN
> org.apache.flink.shaded.curator.org.apache.curator.utils.ZKPaths  - The
> version of ZooKeeper being used doesn't support Container nodes.
> CreateMode.PERSISTENT will be used instead.
> 2018-08-29 11:41:51,101 INFO
> org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - Web
> frontend listening at http://flink-jobmanager-1:8081.
> 2018-08-29 11:41:51,114 INFO
> org.apache.flink.runtime.rpc.akka.AkkaRpcService              - Starting
> RPC endpoint for
> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager at
> akka://flink/user/resourcemanager .
> 2018-08-29 11:41:51,141 INFO
> org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    -
> http://flink-jobmanager-1:8081 was granted leadership with
> leaderSessionID=bb0d4dfd-c2c4-480b-bc86-62e231a606dd
> 2018-08-29 11:41:51,214 INFO
> org.apache.flink.runtime.rpc.akka.AkkaRpcService              - Starting
> RPC endpoint for org.apache.flink.runtime.dispatcher.StandaloneDispatcher
> at akka://flink/user/dispatcher .
> 2018-08-29 11:41:51,230 INFO
> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  -
> Starting ZooKeeperLeaderElectionService
> ZooKeeperLeaderElectionService{leaderPath='/leader/resource_manager_lock'}.
> 2018-08-29 11:41:51,232 INFO
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  -
> Starting ZooKeeperLeaderRetrievalService /leader/resource_manager_lock.
> 2018-08-29 11:41:51,234 INFO
> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  -
> Starting ZooKeeperLeaderElectionService
> ZooKeeperLeaderElectionService{leaderPath='/leader/dispatcher_lock'}.
> 2018-08-29 11:41:51,235 INFO
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  -
> Starting ZooKeeperLeaderRetrievalService /leader/dispatcher_lock.
> 2018-08-29 11:41:51,253 INFO
> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager  -
> ResourceManager akka.tcp://flink@flink-jobmanager-1:50010/user/resourcemanager
> was granted leadership with fencing token ba47ed8daa8ff16bea6fc355c13f4d49
> 2018-08-29 11:41:51,254 INFO
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager  -
> Starting the SlotManager.
> 2018-08-29 11:41:51,263 INFO
> org.apache.flink.runtime.dispatcher.StandaloneDispatcher      - Dispatcher
> akka.tcp://flink@flink-jobmanager-1:50010/user/dispatcher was granted
> leadership with fencing token 703301bf-85e7-4464-990f-ad39128a7b4d
> 2018-08-29 11:41:51,263 INFO
> org.apache.flink.runtime.dispatcher.StandaloneDispatcher      - Recovering
> all persisted jobs.
> 2018-08-29 11:41:51,468 INFO
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager  -
> Registering TaskManager c8a3201d58d87dbbe16f8eb352b5c5b6 under
> 1c5bf0bc3848bd384b6f032ff7213754 at the SlotManager.
> 2018-08-29 11:41:51,471 INFO
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager  -
> Registering TaskManager 104d18b72fed054620e58e120a1ea083 under
> e9d3e8ad3b477dd2e58bcb88a2c0d061 at the SlotManager.
>
> *Starting Jobmanager-2:*
>
> Starting Job Manager
> sed: cannot rename /opt/flink/conf/sedH2ZiSu: Device or resource busy
> config file:
> jobmanager.rpc.address: flink-jobmanager-2
> jobmanager.rpc.port: 6123
> jobmanager.heap.size: 8192
> taskmanager.heap.size: 8192
> taskmanager.numberOfTaskSlots: 4
> high-availability: zookeeper
> high-availability.storageDir:
> hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/high-availability
> high-availability.zookeeper.quorum: zk-cs:2181
> high-availability.zookeeper.path.root: /flink
> high-availability.jobmanager.port: 50010
> state.backend: filesystem
> state.checkpoints.dir:
> hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/checkpoints
> state.savepoints.dir:
> hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/savepoints
> state.backend.incremental: false
> fs.default-scheme:
> hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020
> rest.port: 8081
> web.upload.dir: /opt/flink/upload
> query.server.port: 6125
> taskmanager.numberOfTaskSlots: 4
> classloader.parent-first-patterns.additional: org.apache.xerces.
> blob.storage.directory: /opt/flink/blob-server
> blob.server.port: 6124
> blob.server.port: 6124
> query.server.port: 6125
> Starting standalonesession as a console application on host
> flink-jobmanager-2-7844b78c9-kmvw9.
> 2018-08-29 11:41:51,688 INFO
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
> --------------------------------------------------------------------------------
> 2018-08-29 11:41:51,690 INFO
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Starting
> StandaloneSessionClusterEntrypoint (Version: 1.5.3, Rev:614f216,
> Date:16.08.2018 @ 06:39:50 GMT)
> 2018-08-29 11:41:51,690 INFO
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  OS current
> user: flink
> 2018-08-29 11:41:52,018 WARN  org.apache.hadoop.util.NativeCodeLoader
>                  - Unable to load native-hadoop library for your
> platform... using builtin-java classes where applicable
> 2018-08-29 11:41:52,088 INFO
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Current
> Hadoop/Kerberos user: flink
> 2018-08-29 11:41:52,088 INFO
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  JVM:
> OpenJDK 64-Bit Server VM - Oracle Corporation - 1.8/25.181-b13
> 2018-08-29 11:41:52,088 INFO
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Maximum
> heap size: 6702 MiBytes
> 2018-08-29 11:41:52,088 INFO
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  JAVA_HOME:
> /docker-java-home/jre
> 2018-08-29 11:41:52,091 INFO
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Hadoop
> version: 2.7.5
> 2018-08-29 11:41:52,091 INFO
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  JVM
> Options:
> 2018-08-29 11:41:52,091 INFO
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
>  -Dlog4j.configuration=file:/opt/flink/conf/log4j-console.properties
> 2018-08-29 11:41:52,091 INFO
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
>  -Dlogback.configurationFile=file:/opt/flink/conf/logback-console.xml
> 2018-08-29 11:41:52,091 INFO
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Program
> Arguments:
> 2018-08-29 11:41:52,091 INFO
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
>  --configDir
> 2018-08-29 11:41:52,091 INFO
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
>  /opt/flink/conf
> 2018-08-29 11:41:52,091 INFO
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
>  --executionMode
> 2018-08-29 11:41:52,091 INFO
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     cluster
> 2018-08-29 11:41:52,091 INFO
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     --host
> 2018-08-29 11:41:52,091 INFO
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     cluster
> 2018-08-29 11:41:52,091 INFO
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Classpath:
> /opt/flink/lib/flink-python_2.11-1.5.3.jar:/opt/flink/lib/flink-shaded-hadoop2-uber-1.5.3.jar:/opt/flink/lib/log4j-1.2.17.jar:/opt/flink/lib/slf4j-log4j12-1.7.7.jar:/opt/flink/lib/flink-dist_2.11-1.5.3.jar:::
> 2018-08-29 11:41:52,091 INFO
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
> --------------------------------------------------------------------------------
> 2018-08-29 11:41:52,092 INFO
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Registered
> UNIX signal handlers for [TERM, HUP, INT]
> 2018-08-29 11:41:52,103 INFO
> org.apache.flink.configuration.GlobalConfiguration            - Loading
> configuration property: jobmanager.rpc.address, flink-jobmanager-2
> 2018-08-29 11:41:52,103 INFO
> org.apache.flink.configuration.GlobalConfiguration            - Loading
> configuration property: jobmanager.rpc.port, 6123
> 2018-08-29 11:41:52,103 INFO
> org.apache.flink.configuration.GlobalConfiguration            - Loading
> configuration property: jobmanager.heap.size, 8192
> 2018-08-29 11:41:52,104 INFO
> org.apache.flink.configuration.GlobalConfiguration            - Loading
> configuration property: taskmanager.heap.size, 8192
> 2018-08-29 11:41:52,104 INFO
> org.apache.flink.configuration.GlobalConfiguration            - Loading
> configuration property: taskmanager.numberOfTaskSlots, 4
> 2018-08-29 11:41:52,104 INFO
> org.apache.flink.configuration.GlobalConfiguration            - Loading
> configuration property: high-availability, zookeeper
> 2018-08-29 11:41:52,104 INFO
> org.apache.flink.configuration.GlobalConfiguration            - Loading
> configuration property: high-availability.storageDir,
> hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/high-availability
> 2018-08-29 11:41:52,104 INFO
> org.apache.flink.configuration.GlobalConfiguration            - Loading
> configuration property: high-availability.zookeeper.quorum, zk-cs:2181
> 2018-08-29 11:41:52,104 INFO
> org.apache.flink.configuration.GlobalConfiguration            - Loading
> configuration property: high-availability.zookeeper.path.root, /flink
> 2018-08-29 11:41:52,105 INFO
> org.apache.flink.configuration.GlobalConfiguration            - Loading
> configuration property: high-availability.jobmanager.port, 50010
> 2018-08-29 11:41:52,105 INFO
> org.apache.flink.configuration.GlobalConfiguration            - Loading
> configuration property: state.backend, filesystem
> 2018-08-29 11:41:52,105 INFO
> org.apache.flink.configuration.GlobalConfiguration            - Loading
> configuration property: state.checkpoints.dir,
> hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/checkpoints
> 2018-08-29 11:41:52,105 INFO
> org.apache.flink.configuration.GlobalConfiguration            - Loading
> configuration property: state.savepoints.dir,
> hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/savepoints
> 2018-08-29 11:41:52,105 INFO
> org.apache.flink.configuration.GlobalConfiguration            - Loading
> configuration property: state.backend.incremental, false
> 2018-08-29 11:41:52,106 INFO
> org.apache.flink.configuration.GlobalConfiguration            - Loading
> configuration property: fs.default-scheme,
> hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020
> 2018-08-29 11:41:52,106 INFO
> org.apache.flink.configuration.GlobalConfiguration            - Loading
> configuration property: rest.port, 8081
> 2018-08-29 11:41:52,106 INFO
> org.apache.flink.configuration.GlobalConfiguration            - Loading
> configuration property: web.upload.dir, /opt/flink/upload
> 2018-08-29 11:41:52,106 INFO
> org.apache.flink.configuration.GlobalConfiguration            - Loading
> configuration property: query.server.port, 6125
> 2018-08-29 11:41:52,106 INFO
> org.apache.flink.configuration.GlobalConfiguration            - Loading
> configuration property: taskmanager.numberOfTaskSlots, 4
> 2018-08-29 11:41:52,107 INFO
> org.apache.flink.configuration.GlobalConfiguration            - Loading
> configuration property: classloader.parent-first-patterns.additional,
> org.apache.xerces.
> 2018-08-29 11:41:52,107 INFO
> org.apache.flink.configuration.GlobalConfiguration            - Loading
> configuration property: blob.storage.directory, /opt/flink/blob-server
> 2018-08-29 11:41:52,107 INFO
> org.apache.flink.configuration.GlobalConfiguration            - Loading
> configuration property: blob.server.port, 6124
> 2018-08-29 11:41:52,107 INFO
> org.apache.flink.configuration.GlobalConfiguration            - Loading
> configuration property: blob.server.port, 6124
> 2018-08-29 11:41:52,107 INFO
> org.apache.flink.configuration.GlobalConfiguration            - Loading
> configuration property: query.server.port, 6125
> 2018-08-29 11:41:52,122 INFO
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Starting
> StandaloneSessionClusterEntrypoint.
> 2018-08-29 11:41:52,123 INFO
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Install
> default filesystem.
> 2018-08-29 11:41:52,133 INFO
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Install
> security context.
> 2018-08-29 11:41:52,173 INFO
> org.apache.flink.runtime.security.modules.HadoopModule        - Hadoop user
> set to flink (auth:SIMPLE)
> 2018-08-29 11:41:52,188 INFO
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
> Initializing cluster services.
> 2018-08-29 11:41:52,198 INFO
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Trying to
> start actor system at flink-jobmanager-2:50010
> 2018-08-29 11:41:52,753 INFO  akka.event.slf4j.Slf4jLogger
>                   - Slf4jLogger started
> 2018-08-29 11:41:52,822 INFO  akka.remote.Remoting
>                   - Starting remoting
> 2018-08-29 11:41:53,038 INFO  akka.remote.Remoting
>                   - Remoting started; listening on addresses
> :[akka.tcp://flink@flink-jobmanager-2:50010]
> 2018-08-29 11:41:53,046 INFO
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Actor
> system started at akka.tcp://flink@flink-jobmanager-2:50010
> 2018-08-29 11:41:53,500 INFO
> org.apache.flink.runtime.blob.FileSystemBlobStore             - Creating
> highly available BLOB storage directory at
> hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/high-availability//default/blob
> 2018-08-29 11:41:53,558 INFO
> org.apache.flink.runtime.util.ZooKeeperUtils                  - Enforcing
> default ACL for ZK connections
> 2018-08-29 11:41:53,559 INFO
> org.apache.flink.runtime.util.ZooKeeperUtils                  - Using
> '/flink/default' as Zookeeper namespace.
> 2018-08-29 11:41:53,616 INFO
> org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl
> - Starting
> 2018-08-29 11:41:53,624 INFO
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
> environment:zookeeper.version=3.4.10-39d3a4f269333c922ed3db283be479f9deacaa0f,
> built on 03/23/2017 10:13 GMT
> 2018-08-29 11:41:53,625 INFO
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
> environment:host.name=flink-jobmanager-2-7844b78c9-kmvw9
> 2018-08-29 11:41:53,625 INFO
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
> environment:java.version=1.8.0_181
> 2018-08-29 11:41:53,625 INFO
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
> environment:java.vendor=Oracle Corporation
> 2018-08-29 11:41:53,625 INFO
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
> environment:java.home=/usr/lib/jvm/java-8-openjdk-amd64/jre
> 2018-08-29 11:41:53,625 INFO
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
> environment:java.class.path=/opt/flink/lib/flink-python_2.11-1.5.3.jar:/opt/flink/lib/flink-shaded-hadoop2-uber-1.5.3.jar:/opt/flink/lib/log4j-1.2.17.jar:/opt/flink/lib/slf4j-log4j12-1.7.7.jar:/opt/flink/lib/flink-dist_2.11-1.5.3.jar:::
> 2018-08-29 11:41:53,625 INFO
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
> environment:java.library.path=/usr/java/packages/lib/amd64:/usr/lib/x86_64-linux-gnu/jni:/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu:/usr/lib/jni:/lib:/usr/lib
> 2018-08-29 11:41:53,625 INFO
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
> environment:java.io.tmpdir=/tmp
> 2018-08-29 11:41:53,625 INFO
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
> environment:java.compiler=<NA>
> 2018-08-29 11:41:53,625 INFO
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
> environment:os.name=Linux
> 2018-08-29 11:41:53,625 INFO
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
> environment:os.arch=amd64
> 2018-08-29 11:41:53,625 INFO
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
> environment:os.version=4.4.0-1027-gke
> 2018-08-29 11:41:53,625 INFO
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
> environment:user.name=flink
> 2018-08-29 11:41:53,625 INFO
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
> environment:user.home=/opt/flink
> 2018-08-29 11:41:53,626 INFO
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
> environment:user.dir=/opt/flink
> 2018-08-29 11:41:53,626 INFO
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  -
> Initiating client connection, connectString=zk-cs:2181 sessionTimeout=60000
> watcher=org.apache.flink.shaded.curator.org.apache.curator.ConnectionState@17ae7628
> 2018-08-29 11:41:53,644 WARN
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - SASL
> configuration failed: javax.security.auth.login.LoginException: No JAAS
> configuration section named 'Client' was found in specified JAAS
> configuration file: '/tmp/jaas-8238466329925822361.conf'. Will continue
> connection to Zookeeper server without SASL authentication, if Zookeeper
> server allows it.
> 2018-08-29 11:41:53,646 INFO
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  -
> Opening socket connection to server zk-cs.default.svc.cluster.local/
> 10.27.248.104:2181
> 2018-08-29 11:41:53,646 INFO  org.apache.flink.runtime.blob.BlobServer
>                   - Created BLOB server storage directory
> /opt/flink/blob-server/blobStore-61cdb645-5d0c-47fd-bcf6-84ad16fadade
> 2018-08-29 11:41:53,646 ERROR
> org.apache.flink.shaded.curator.org.apache.curator.ConnectionState  -
> Authentication failed
> 2018-08-29 11:41:53,647 INFO
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Socket
> connection established to zk-cs.default.svc.cluster.local/
> 10.27.248.104:2181, initiating session
> 2018-08-29 11:41:53,649 INFO  org.apache.flink.runtime.blob.BlobServer
>                   - Started BLOB server at 0.0.0.0:6124 - max concurrent
> requests: 50 - max backlog: 1000
> 2018-08-29 11:41:53,655 INFO
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  -
> Session establishment complete on server zk-cs.default.svc.cluster.local/
> 10.27.248.104:2181, sessionid = 0x26584fd55690006, negotiated timeout =
> 40000
> 2018-08-29 11:41:53,656 INFO
> org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager
> - State change: CONNECTED
> 2018-08-29 11:41:53,667 INFO
> org.apache.flink.runtime.metrics.MetricRegistryImpl           - No metrics
> reporter configured, no metrics will be exposed/reported.
> 2018-08-29 11:41:53,673 INFO
> org.apache.flink.runtime.dispatcher.FileArchivedExecutionGraphStore  -
> Initializing FileArchivedExecutionGraphStore: Storage directory
> /tmp/executionGraphStore-8b236c14-79ee-4a84-b23f-437408c4661a, expiration
> time 3600000, maximum cache size 52428800 bytes.
> 2018-08-29 11:41:53,699 INFO
> org.apache.flink.runtime.blob.TransientBlobCache              - Created
> BLOB cache storage directory
> /opt/flink/blob-server/blobStore-80c519df-cc6f-4e9c-9cd5-da4077c826f0
> 2018-08-29 11:41:53,717 WARN
> org.apache.flink.configuration.Configuration                  - Config uses
> deprecated configuration key 'jobmanager.rpc.address' instead of proper key
> 'rest.address'
> 2018-08-29 11:41:53,718 WARN
> org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - Upload
> directory /opt/flink/upload/flink-web-upload does not exist, or has been
> deleted externally. Previously uploaded files are no longer available.
> 2018-08-29 11:41:53,719 INFO
> org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - Created
> directory /opt/flink/upload/flink-web-upload for file uploads.
> 2018-08-29 11:41:53,722 INFO
> org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - Starting
> rest endpoint.
> 2018-08-29 11:41:54,084 WARN
> org.apache.flink.runtime.webmonitor.WebMonitorUtils           - Log file
> environment variable 'log.file' is not set.
> 2018-08-29 11:41:54,084 WARN
> org.apache.flink.runtime.webmonitor.WebMonitorUtils           - JobManager
> log files are unavailable in the web dashboard. Log file location not found
> in environment variable 'log.file' or configuration key 'Key:
> 'web.log.path' , default: null (deprecated keys:
> [jobmanager.web.log.path])'.
> 2018-08-29 11:41:54,160 INFO
> org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - Rest
> endpoint listening at flink-jobmanager-2:8081
> 2018-08-29 11:41:54,160 INFO
> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  -
> Starting ZooKeeperLeaderElectionService
> ZooKeeperLeaderElectionService{leaderPath='/leader/rest_server_lock'}.
> 2018-08-29 11:41:54,180 INFO
> org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - Web
> frontend listening at http://flink-jobmanager-2:8081.
> 2018-08-29 11:41:54,192 INFO
> org.apache.flink.runtime.rpc.akka.AkkaRpcService              - Starting
> RPC endpoint for
> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager at
> akka://flink/user/resourcemanager .
> 2018-08-29 11:41:54,273 INFO
> org.apache.flink.runtime.rpc.akka.AkkaRpcService              - Starting
> RPC endpoint for org.apache.flink.runtime.dispatcher.StandaloneDispatcher
> at akka://flink/user/dispatcher .
> 2018-08-29 11:41:54,286 INFO
> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  -
> Starting ZooKeeperLeaderElectionService
> ZooKeeperLeaderElectionService{leaderPath='/leader/resource_manager_lock'}.
> 2018-08-29 11:41:54,287 INFO
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  -
> Starting ZooKeeperLeaderRetrievalService /leader/resource_manager_lock.
> 2018-08-29 11:41:54,289 INFO
> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  -
> Starting ZooKeeperLeaderElectionService
> ZooKeeperLeaderElectionService{leaderPath='/leader/dispatcher_lock'}.
> 2018-08-29 11:41:54,289 INFO
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  -
> Starting ZooKeeperLeaderRetrievalService /leader/dispatcher_lock.
>
> *Upon submitting a batch job on Jobmanager-1, we immediately get this log
> on Jobmanager-2*
> 2018-08-29 11:47:06,249 INFO
> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  -
> Recovered SubmittedJobGraph(d69b67e4d28a2d244b06d3f6d661bca1, null).
>
> *Meanwhile Jobmanager-1 gets:*
> *-FlinkBatchPipelineTranslator pipeline logs- (we use Apache Beam)*
>
> 2018-08-29 11:47:06,006 INFO
> org.apache.flink.runtime.dispatcher.StandaloneDispatcher      - Submitting
> job d69b67e4d28a2d244b06d3f6d661bca1
> (sicassandrawriterbeam-flink-0829114703-7d95fabd).
> 2018-08-29 11:47:06,090 INFO
> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  -
> Added SubmittedJobGraph(d69b67e4d28a2d244b06d3f6d661bca1, null) to
> ZooKeeper.
>
> *-loads of job execution info-*
>
> 2018-08-29 11:49:20,272 INFO
> org.apache.flink.runtime.dispatcher.StandaloneDispatcher      - Job
> d69b67e4d28a2d244b06d3f6d661bca1 reached globally terminal state FINISHED.
> 2018-08-29 11:49:20,286 INFO
> org.apache.flink.runtime.jobmaster.JobMaster                  - Stopping
> the JobMaster for job
> sicassandrawriterbeam-flink-0829114703-7d95fabd(d69b67e4d28a2d244b06d3f6d661bca1).
> 2018-08-29 11:49:20,290 INFO
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  -
> Stopping ZooKeeperLeaderRetrievalService /leader/resource_manager_lock.
> 2018-08-29 11:49:20,292 INFO
> org.apache.flink.runtime.jobmaster.JobMaster                  - Close
> ResourceManager connection 827b94881bf7c94d8516907e04e3a564: JobManager is
> shutting down..
> 2018-08-29 11:49:20,292 INFO
> org.apache.flink.runtime.jobmaster.slotpool.SlotPool          - Suspending
> SlotPool.
> 2018-08-29 11:49:20,293 INFO
> org.apache.flink.runtime.jobmaster.slotpool.SlotPool          - Stopping
> SlotPool.
> 2018-08-29 11:49:20,293 INFO
> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager  -
> Disconnect job manager a3dab0a0883c5f0f37943358d9104d79
> @akka.tcp://flink@flink-jobmanager-1:50010/user/jobmanager_0 for job
> d69b67e4d28a2d244b06d3f6d661bca1 from the resource manager.
> 2018-08-29 11:49:20,293 INFO
> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  -
> Stopping ZooKeeperLeaderElectionService
> ZooKeeperLeaderElectionService{leaderPath='/leader/d69b67e4d28a2d244b06d3f6d661bca1/job_manager_lock'}.
> 2018-08-29 11:49:20,304 INFO
> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  -
> Removed job graph d69b67e4d28a2d244b06d3f6d661bca1 from ZooKeeper.
>
>
> -------------------
>
> The result is:
> HDFS has only a jobgraph and an empty default folder - everything else is
> cleared
> ZooKeeper has the jobgraph that Jobmanager-1 claims to have removed in the
> last log still there.
>
> On Wed, Aug 29, 2018 at 12:14 PM Till Rohrmann <trohrm...@apache.org>
> wrote:
>
>> Hi Encho,
>>
>> it sounds strange that the standby JobManager tries to recover a
>> submitted job graph. This should only happen if it has been granted
>> leadership. Thus, it seems as if the standby JobManager thinks that it is
>> also the leader. Could you maybe share the logs of the two
>> JobManagers/ClusterEntrypoints with us?
>>
>> Running only a single JobManager/ClusterEntrypoint in HA mode via a
>> Kubernetes Deployment should do the trick and there is nothing wrong with
>> it.
>>
>> Cheers,
>> Till
>>
>> On Wed, Aug 29, 2018 at 11:05 AM Encho Mishinev <encho.mishi...@gmail.com>
>> wrote:
>>
>>> Hello,
>>>
>>> Since two job managers don't seem to be working for me I was thinking of
>>> just using a single job manager in Kubernetes in HA mode with a deployment
>>> ensuring its restart whenever it fails. Is this approach viable? The
>>> High-Availability page mentions that you use only one job manager in an
>>> YARN cluster but does not specify such option for Kubernetes. Is there
>>> anything that can go wrong with this approach?
>>>
>>> Thanks
>>>
>>> On Wed, Aug 29, 2018 at 11:10 AM Encho Mishinev <
>>> encho.mishi...@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> Unfortunately the thing I described does indeed happen every time. As
>>>> mentioned in the first email, I am running on Kubernetes so certain things
>>>> could be different compared to just a standalone cluster.
>>>>
>>>> Any ideas for workarounds are welcome, as this problem basically
>>>> prevents me from using HA.
>>>>
>>>> Thanks,
>>>> Encho
>>>>
>>>> On Wed, Aug 29, 2018 at 5:15 AM vino yang <yanghua1...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Encho,
>>>>>
>>>>> From your description, I feel that there are extra bugs.
>>>>>
>>>>> About your description:
>>>>>
>>>>> *- Start both job managers*
>>>>> *- Start a batch job in JobManager 1 and let it finish*
>>>>> *The jobgraphs in both Zookeeper and HDFS remained.*
>>>>>
>>>>> Is it necessarily happening every time?
>>>>>
>>>>> In the Standalone cluster, the problems we encountered were sporadic.
>>>>>
>>>>> Thanks, vino.
>>>>>
>>>>> Encho Mishinev <encho.mishi...@gmail.com> 于2018年8月28日周二 下午8:07写道：
>>>>>
>>>>>> Hello Till,
>>>>>>
>>>>>> I spend a few more hours testing and looking at the logs and it seems
>>>>>> like there's a more general problem here. While the two job managers are
>>>>>> active neither of them can properly delete jobgraphs. The above problem I
>>>>>> described comes from the fact that Kubernetes gets JobManager 1 quickly
>>>>>> after I manually kill it, so when I stop the job on JobManager 2 both are
>>>>>> alive.
>>>>>>
>>>>>> I did a very simple test:
>>>>>>
>>>>>> - Start both job managers
>>>>>> - Start a batch job in JobManager 1 and let it finish
>>>>>> The jobgraphs in both Zookeeper and HDFS remained.
>>>>>>
>>>>>> On the other hand if we do:
>>>>>>
>>>>>> - Start only JobManager 1 (again in HA mode)
>>>>>> - Start a batch job and let it finish
>>>>>> The jobgraphs in both Zookeeper and HDFS are deleted fine.
>>>>>>
>>>>>> It seems like the standby manager still leaves some kind of lock on
>>>>>> the jobgraphs. Do you think that's possible? Have you seen a similar
>>>>>> problem?
>>>>>> The only logs that appear on the standby manager while waiting are of
>>>>>> the type:
>>>>>>
>>>>>> 2018-08-28 11:54:10,789 INFO
>>>>>> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  -
>>>>>> Recovered SubmittedJobGraph(9e0a109b57511930c95d3b54574a66e3, null).
>>>>>>
>>>>>> Note that this log appears on the standby jobmanager immediately when
>>>>>> a new job is submitted to the active jobmanager.
>>>>>> Also note that the blobs and checkpoints are cleared fine. The
>>>>>> problem is only for jobgraphs both in ZooKeeper and HDFS.
>>>>>>
>>>>>> Trying to access the UI of the standby manager redirects to the
>>>>>> active one, so it is not a problem of them not knowing who the leader is.
>>>>>> Do you have any ideas?
>>>>>>
>>>>>> Thanks a lot,
>>>>>> Encho
>>>>>>
>>>>>> On Tue, Aug 28, 2018 at 10:27 AM Till Rohrmann <trohrm...@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Encho,
>>>>>>>
>>>>>>> thanks a lot for reporting this issue. The problem arises whenever
>>>>>>> the old leader maintains the connection to ZooKeeper. If this is the 
>>>>>>> case,
>>>>>>> then ephemeral nodes which we create to protect against faulty delete
>>>>>>> operations are not removed and consequently the new leader is not able 
>>>>>>> to
>>>>>>> delete the persisted job graph. So one thing to check is whether the 
>>>>>>> old JM
>>>>>>> still has an open connection to ZooKeeper. The next thing to check is 
>>>>>>> the
>>>>>>> session timeout of your ZooKeeper cluster. If you stop the job within 
>>>>>>> the
>>>>>>> session timeout, then it is also not guaranteed that ZooKeeper has 
>>>>>>> detected
>>>>>>> that the ephemeral nodes of the old JM must be deleted. In order to
>>>>>>> understand this better it would be helpful if you could tell us the 
>>>>>>> timing
>>>>>>> of the different actions.
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Till
>>>>>>>
>>>>>>> On Tue, Aug 28, 2018 at 8:17 AM vino yang <yanghua1...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Encho,
>>>>>>>>
>>>>>>>> A temporary solution can be used to determine if it has been
>>>>>>>> cleaned up by monitoring the specific JobID under Zookeeper's 
>>>>>>>> "/jobgraph".
>>>>>>>> Another solution, modify the source code, rudely modify the cleanup
>>>>>>>> mode to the synchronous form, but the flink operation Zookeeper's path
>>>>>>>> needs to obtain the corresponding lock, so it is dangerous to do so, 
>>>>>>>> and it
>>>>>>>> is not recommended.
>>>>>>>> I think maybe this problem can be solved in the next version. It
>>>>>>>> depends on Till.
>>>>>>>>
>>>>>>>> Thanks, vino.
>>>>>>>>
>>>>>>>> Encho Mishinev <encho.mishi...@gmail.com> 于2018年8月28日周二 下午1:17写道：
>>>>>>>>
>>>>>>>>> Thank you very much for the info! Will keep track of the progress.
>>>>>>>>>
>>>>>>>>> In the meantime is there any viable workaround? It seems like HA
>>>>>>>>> doesn't really work due to this bug.
>>>>>>>>>
>>>>>>>>> On Tue, Aug 28, 2018 at 4:52 AM vino yang <yanghua1...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> About some implementation mechanisms.
>>>>>>>>>> Flink uses Zookeeper to store JobGraph (Job's description
>>>>>>>>>> information and metadata) as a basis for Job recovery.
>>>>>>>>>> However, previous implementations may cause this information to
>>>>>>>>>> not be properly cleaned up because it is asynchronously deleted by a
>>>>>>>>>> background thread.
>>>>>>>>>>
>>>>>>>>>> Thanks, vino.
>>>>>>>>>>
>>>>>>>>>> vino yang <yanghua1...@gmail.com> 于2018年8月28日周二 上午9:49写道：
>>>>>>>>>>
>>>>>>>>>>> Hi Encho,
>>>>>>>>>>>
>>>>>>>>>>> This is a problem already known to the Flink community, you can
>>>>>>>>>>> track its progress through FLINK-10011[1], and currently Till is 
>>>>>>>>>>> fixing
>>>>>>>>>>> this issue.
>>>>>>>>>>>
>>>>>>>>>>> [1]: https://issues.apache.org/jira/browse/FLINK-10011
>>>>>>>>>>>
>>>>>>>>>>> Thanks, vino.
>>>>>>>>>>>
>>>>>>>>>>> Encho Mishinev <encho.mishi...@gmail.com> 于2018年8月27日周一
>>>>>>>>>>> 下午10:13写道：
>>>>>>>>>>>
>>>>>>>>>>>> I am running Flink 1.5.3 with two job managers and two task
>>>>>>>>>>>> managers in Kubernetes along with HDFS and Zookeeper in 
>>>>>>>>>>>> high-availability
>>>>>>>>>>>> mode.
>>>>>>>>>>>>
>>>>>>>>>>>> My problem occurs after the following actions:
>>>>>>>>>>>> - Upload a .jar file to jobmanager-1
>>>>>>>>>>>> - Run a streaming job from the jar on jobmanager-1
>>>>>>>>>>>> - Wait for 1 or 2 checkpoints to succeed
>>>>>>>>>>>> - Kill pod of jobmanager-1
>>>>>>>>>>>> After a short delay, jobmanager-2 takes leadership and
>>>>>>>>>>>> correctly restores the job and continues it
>>>>>>>>>>>> - Stop job from jobmanager-2
>>>>>>>>>>>>
>>>>>>>>>>>> At this point all seems well, but the problem is that
>>>>>>>>>>>> jobmanager-2 does not clean up anything that was left from 
>>>>>>>>>>>> jobmanager-1.
>>>>>>>>>>>> This means that both in HDFS and in Zookeeper remain job graphs, 
>>>>>>>>>>>> which
>>>>>>>>>>>> later on obstruct any work of both managers as after any reset they
>>>>>>>>>>>> unsuccessfully try to restore a non-existent job and fail over and 
>>>>>>>>>>>> over
>>>>>>>>>>>> again.
>>>>>>>>>>>>
>>>>>>>>>>>> I am quite certain that jobmanager-2 does not know about any of
>>>>>>>>>>>> jobmanager-1’s files since the Zookeeper logs reveal that it tries 
>>>>>>>>>>>> to
>>>>>>>>>>>> duplicate job folders:
>>>>>>>>>>>>
>>>>>>>>>>>> 2018-08-27 13:11:00,038 [myid:] - INFO  [ProcessThread(sid:0
>>>>>>>>>>>> cport:2181)::PrepRequestProcessor@648] - Got user-level
>>>>>>>>>>>> KeeperException when processing sessionid:0x1657aa15e480033 
>>>>>>>>>>>> type:create
>>>>>>>>>>>> cxid:0x46 zxid:0x1ab txntype:-1 reqpath:n/a Error
>>>>>>>>>>>> Path:/flink/default/jobgraphs/83bfa359ca59ce1d4635e18e16651e15/bbb259fd-7826-4950-bc7c-c2be23346c77
>>>>>>>>>>>> Error:KeeperErrorCode = NodeExists for
>>>>>>>>>>>> /flink/default/jobgraphs/83bfa359ca59ce1d4635e18e16651e15/bbb259fd-7826-4950-bc7c-c2be23346c77
>>>>>>>>>>>>
>>>>>>>>>>>> 2018-08-27 13:11:02,296 [myid:] - INFO  [ProcessThread(sid:0
>>>>>>>>>>>> cport:2181)::PrepRequestProcessor@648] - Got user-level
>>>>>>>>>>>> KeeperException when processing sessionid:0x1657aa15e480033 
>>>>>>>>>>>> type:create
>>>>>>>>>>>> cxid:0x5c zxid:0x1ac txntype:-1 reqpath:n/a Error
>>>>>>>>>>>> Path:/flink/default/checkpoint-counter/83bfa359ca59ce1d4635e18e16651e15
>>>>>>>>>>>> Error:KeeperErrorCode = NodeExists for
>>>>>>>>>>>> /flink/default/checkpoint-counter/83bfa359ca59ce1d4635e18e16651e15
>>>>>>>>>>>>
>>>>>>>>>>>> Also jobmanager-2 attempts to delete the jobgraphs folder in
>>>>>>>>>>>> Zookeeper when the job is stopped, but fails since there are 
>>>>>>>>>>>> leftover files
>>>>>>>>>>>> in it from jobmanager-1:
>>>>>>>>>>>>
>>>>>>>>>>>> 2018-08-27 13:12:13,406 [myid:] - INFO  [ProcessThread(sid:0
>>>>>>>>>>>> cport:2181)::PrepRequestProcessor@648] - Got user-level
>>>>>>>>>>>> KeeperException when processing sessionid:0x1657aa15e480033 
>>>>>>>>>>>> type:delete
>>>>>>>>>>>> cxid:0xa8 zxid:0x1bd txntype:-1 reqpath:n/a Error
>>>>>>>>>>>> Path:/flink/default/jobgraphs/83bfa359ca59ce1d4635e18e16651e15
>>>>>>>>>>>> Error:KeeperErrorCode = Directory not empty for
>>>>>>>>>>>> /flink/default/jobgraphs/83bfa359ca59ce1d4635e18e16651e15
>>>>>>>>>>>>
>>>>>>>>>>>> I’ve noticed that when restoring the job, it seems like
>>>>>>>>>>>> jobmanager-2 does not get anything more than jobID, while it 
>>>>>>>>>>>> perhaps needs
>>>>>>>>>>>> some metadata? Here is the log that seems suspicious to me:
>>>>>>>>>>>>
>>>>>>>>>>>> 2018-08-27 13:09:18,113 INFO
>>>>>>>>>>>> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore
>>>>>>>>>>>>   -
>>>>>>>>>>>> Recovered SubmittedJobGraph(83bfa359ca59ce1d4635e18e16651e15, 
>>>>>>>>>>>> null).
>>>>>>>>>>>>
>>>>>>>>>>>> All other logs seem fine in jobmanager-2, it doesn’t seem to be
>>>>>>>>>>>> aware that it’s overwriting anything or not deleting properly.
>>>>>>>>>>>>
>>>>>>>>>>>> My question is - what is the intended way for the job managers
>>>>>>>>>>>> to correctly exchange metadata in HA mode and why is it not 
>>>>>>>>>>>> working for me?
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks in advance!
>>>>>>>>>>>
>>>>>>>>>>>

Re: JobGraphs not cleaned up in HA mode

Reply via email to