Hi Piotr, Jobmanager logs are attached to this email. The only thing that jumps out to me is this:
09/08/2021 09:02:26.240 -0400 ERROR org.apache.flink.runtime.history.FsJobArchivist Failed to archive job. java.io.IOException: File already exists:s3p://flink-s3-bucket/history/2db4ee6397151a1109d1ca05188a4cbb This happened days after the Flink update – and not just once. Across all our Flink clusters I’ve seen this 3 times. The cause for the jobmanager leadership loss in this case was a deployment of our zookeeper cluster that lead to a brief connection loss. The new leader election is expected. Thanks, Peter From: Piotr Nowojski <pnowoj...@apache.org> Date: Thursday, September 9, 2021 at 12:39 AM To: Peter Westermann <no.westerm...@genesys.com> Cc: user@flink.apache.org <user@flink.apache.org> Subject: Re: Duplicate copies of job in Flink UI/API Hi Peter, Can you provide relevant JobManager logs? And can you write down what steps have you taken before the failure happened? Did this failure occur during upgrading Flink, or after the upgrade etc. Best, Piotrek śr., 8 wrz 2021 o 16:11 Peter Westermann <no.westerm...@genesys.com<mailto:no.westerm...@genesys.com>> napisał(a): We recently upgraded from Flink 1.12.4 to 1.12.5 and are seeing some weird behavior after a change in jobmanager leadership: We’re seeing two copies of the same job, one of those is in SUSPENDED state and has a start time of zero. Here’s the output from the /jobs/overview endpoint: { "jobs": [{ "jid": "2db4ee6397151a1109d1ca05188a4cbb", "name": "analytics-flink-v1", "state": "RUNNING", "start-time": 1631106146284, "end-time": -1, "duration": 2954642, "last-modification": 1631106152322, "tasks": { "total": 112, "created": 0, "scheduled": 0, "deploying": 0, "running": 112, "finished": 0, "canceling": 0, "canceled": 0, "failed": 0, "reconciling": 0 } }, { "jid": "2db4ee6397151a1109d1ca05188a4cbb", "name": "analytics-flink-v1", "state": "SUSPENDED", "start-time": 0, "end-time": -1, "duration": 1631105900760, "last-modification": 0, "tasks": { "total": 0, "created": 0, "scheduled": 0, "deploying": 0, "running": 0, "finished": 0, "canceling": 0, "canceled": 0, "failed": 0, "reconciling": 0 } }] } Has anyone seen this behavior before? Thanks, Peter
09/08/2021 09:02:31.015 -0400 INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager Request slot with profile ResourceProfile{UNKNOWN} for job 2db4ee6397151a1109d1ca05188a4cbb with allocation id fbddb90b669081bd9907c835f1906a79. 09/08/2021 09:02:31.015 -0400 INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager Request slot with profile ResourceProfile{UNKNOWN} for job 2db4ee6397151a1109d1ca05188a4cbb with allocation id ee62d44923180b0ac66e10ed170f0af3. 09/08/2021 09:02:31.015 -0400 INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager Request slot with profile ResourceProfile{UNKNOWN} for job 2db4ee6397151a1109d1ca05188a4cbb with allocation id 13d0e72b41883dfb84e866645b07dc92. 09/08/2021 09:02:31.015 -0400 INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager Request slot with profile ResourceProfile{UNKNOWN} for job 2db4ee6397151a1109d1ca05188a4cbb with allocation id ea4064056de27327a2037f4d71aa9e5c. 09/08/2021 09:02:31.015 -0400 INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager Request slot with profile ResourceProfile{UNKNOWN} for job 2db4ee6397151a1109d1ca05188a4cbb with allocation id cfcc6014e93f09884ad5f61e4a108e8d. 09/08/2021 09:02:31.014 -0400 INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager Request slot with profile ResourceProfile{UNKNOWN} for job 2db4ee6397151a1109d1ca05188a4cbb with allocation id 3dd4d17bf50232c95f178d6a235f2dc1. 09/08/2021 09:02:31.014 -0400 INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager Request slot with profile ResourceProfile{UNKNOWN} for job 2db4ee6397151a1109d1ca05188a4cbb with allocation id 13a7257a9aedd2ce2ca5267f93586763. 09/08/2021 09:02:31.014 -0400 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl Requesting new slot [SlotRequestId{f9b28f7479a60f286846fd9d5f1f4e8e}] and profile ResourceProfile{UNKNOWN} with allocation id fbddb90b669081bd9907c835f1906a79 from resource manager. 09/08/2021 09:02:31.014 -0400 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl Requesting new slot [SlotRequestId{1df0f0709f439fc647c995a36d8c60a7}] and profile ResourceProfile{UNKNOWN} with allocation id ee62d44923180b0ac66e10ed170f0af3 from resource manager. 09/08/2021 09:02:31.014 -0400 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl Requesting new slot [SlotRequestId{2a441ce02b709ffacdcc319d715239d8}] and profile ResourceProfile{UNKNOWN} with allocation id 13d0e72b41883dfb84e866645b07dc92 from resource manager. 09/08/2021 09:02:31.014 -0400 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl Requesting new slot [SlotRequestId{9e3606665fc9b08e63a1cb399863534b}] and profile ResourceProfile{UNKNOWN} with allocation id ea4064056de27327a2037f4d71aa9e5c from resource manager. 09/08/2021 09:02:31.014 -0400 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl Requesting new slot [SlotRequestId{802408aafe4e015591902c107cecef02}] and profile ResourceProfile{UNKNOWN} with allocation id cfcc6014e93f09884ad5f61e4a108e8d from resource manager. 09/08/2021 09:02:31.014 -0400 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl Requesting new slot [SlotRequestId{e0a0f561e532d36e64feafee9eae8100}] and profile ResourceProfile{UNKNOWN} with allocation id 3dd4d17bf50232c95f178d6a235f2dc1 from resource manager. 09/08/2021 09:02:31.013 -0400 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl Requesting new slot [SlotRequestId{4cfc987782b4aed47bb5d98394601db6}] and profile ResourceProfile{UNKNOWN} with allocation id 13a7257a9aedd2ce2ca5267f93586763 from resource manager. 09/08/2021 09:02:31.013 -0400 INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager Request slot with profile ResourceProfile{UNKNOWN} for job 2db4ee6397151a1109d1ca05188a4cbb with allocation id 2e742af203d1be847d06c643f3984b54. 09/08/2021 09:02:31.013 -0400 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl Requesting new slot [SlotRequestId{0f25b099a7dd972a69c31532798e19ab}] and profile ResourceProfile{UNKNOWN} with allocation id 2e742af203d1be847d06c643f3984b54 from resource manager. 09/08/2021 09:02:31.013 -0400 INFO org.apache.flink.runtime.jobmaster.JobMaster JobManager successfully registered at ResourceManager, leader id: b0f998b265508bb4cd715749318a4ce4. 09/08/2021 09:02:31.013 -0400 INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager Registered job manager 038f7d4cea297118551aed586a338a49://0aaaf7660b7f4414af376af4f91f9500:50001/user/rpc/jobmanager_4 for job 2db4ee6397151a1109d1ca05188a4cbb. 09/08/2021 09:02:31.004 -0400 INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager Registering job manager 038f7d4cea297118551aed586a338a49://0aaaf7660b7f4414af376af4f91f9500:50001/user/rpc/jobmanager_4 for job 2db4ee6397151a1109d1ca05188a4cbb. 09/08/2021 09:02:31.004 -0400 INFO org.apache.flink.runtime.leaderretrieval.DefaultLeaderRetrievalService Starting DefaultLeaderRetrievalService with ZookeeperLeaderRetrievalDriver{retrievalPath='/leader/2db4ee6397151a1109d1ca05188a4cbb/job_manager_lock'}. 09/08/2021 09:02:31.004 -0400 INFO org.apache.flink.runtime.jobmaster.JobMaster Resolved ResourceManager address, beginning registration 09/08/2021 09:02:31.004 -0400 INFO org.apache.flink.runtime.jobmaster.JobMaster Connecting to ResourceManager akka.ssl.tcp://0aaaf7660b7f4414af376af4f91f9500:50001/user/rpc/resourcemanager_0(b0f998b265508bb4cd715749318a4ce4) 09/08/2021 09:02:31.002 -0400 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl Cannot serve slot request, no ResourceManager connected. Adding as pending request [SlotRequestId{f9b28f7479a60f286846fd9d5f1f4e8e}] 09/08/2021 09:02:31.001 -0400 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl Cannot serve slot request, no ResourceManager connected. Adding as pending request [SlotRequestId{1df0f0709f439fc647c995a36d8c60a7}] 09/08/2021 09:02:31.001 -0400 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl Cannot serve slot request, no ResourceManager connected. Adding as pending request [SlotRequestId{2a441ce02b709ffacdcc319d715239d8}] 09/08/2021 09:02:31.001 -0400 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl Cannot serve slot request, no ResourceManager connected. Adding as pending request [SlotRequestId{9e3606665fc9b08e63a1cb399863534b}] 09/08/2021 09:02:31.001 -0400 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl Cannot serve slot request, no ResourceManager connected. Adding as pending request [SlotRequestId{802408aafe4e015591902c107cecef02}] 09/08/2021 09:02:31.000 -0400 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl Cannot serve slot request, no ResourceManager connected. Adding as pending request [SlotRequestId{e0a0f561e532d36e64feafee9eae8100}] 09/08/2021 09:02:31.000 -0400 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl Cannot serve slot request, no ResourceManager connected. Adding as pending request [SlotRequestId{4cfc987782b4aed47bb5d98394601db6}] 09/08/2021 09:02:31.000 -0400 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl Cannot serve slot request, no ResourceManager connected. Adding as pending request [SlotRequestId{0f25b099a7dd972a69c31532798e19ab}] 09/08/2021 09:02:30.997 -0400 INFO org.apache.flink.runtime.jobmaster.JobMaster Starting scheduling with scheduling strategy [org.apache.flink.runtime.scheduler.strategy.PipelinedRegionSchedulingStrategy] 09/08/2021 09:02:30.996 -0400 INFO org.apache.flink.runtime.jobmaster.JobMaster Starting execution of job analytics-flink-v1 (2db4ee6397151a1109d1ca05188a4cbb) under job master id 9ed35c1fb07f037fed21d31d35cc4abf. 09/08/2021 09:02:30.996 -0400 INFO org.apache.flink.runtime.leaderretrieval.DefaultLeaderRetrievalService Starting DefaultLeaderRetrievalService with ZookeeperLeaderRetrievalDriver{retrievalPath='/leader/resource_manager_lock'}. 09/08/2021 09:02:30.992 -0400 INFO org.apache.flink.runtime.jobmaster.JobManagerRunnerImpl JobManager runner for job analytics-flink-v1 (2db4ee6397151a1109d1ca05188a4cbb) was granted leadership with session id ed21d31d-35cc-4abf-9ed3-5c1fb07f037f at akka.ssl.tcp://0aaaf7660b7f4414af376af4f91f9500:50001/user/rpc/jobmanager_4. 09/08/2021 09:02:30.990 -0400 INFO org.apache.flink.runtime.jobmaster.JobMaster Using failover strategy org.apache.flink.runtime.executiongraph.failover.flip1.RestartAllFailoverStrategy@7e7f0b53 for analytics-flink-v1 (2db4ee6397151a1109d1ca05188a4cbb). 09/08/2021 09:02:30.346 -0400 INFO org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore Trying to retrieve checkpoint 318603. 09/08/2021 09:02:29.697 -0400 INFO org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore Trying to retrieve checkpoint 318602. 09/08/2021 09:02:28.861 -0400 INFO org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore Trying to retrieve checkpoint 318601. 09/08/2021 09:02:27.469 -0400 INFO org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore Trying to retrieve checkpoint 318600. 09/08/2021 09:02:27.469 -0400 INFO org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore Trying to fetch 4 checkpoints from storage. 09/08/2021 09:02:27.469 -0400 INFO org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore Found 4 checkpoints in ZooKeeperStateHandleStore{namespace='analytics-flink/analytics-flink-v1/3069/checkpoints/2db4ee6397151a1109d1ca05188a4cbb'}. 09/08/2021 09:02:27.454 -0400 INFO org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore Recovering checkpoints from ZooKeeperStateHandleStore{namespace='analytics-flink/analytics-flink-v1/3069/checkpoints/2db4ee6397151a1109d1ca05188a4cbb'}. 09/08/2021 09:02:27.452 -0400 INFO org.apache.flink.runtime.jobmaster.JobMaster Using application-defined state backend: RocksDBStateBackend{checkpointStreamBackend=File State Backend (checkpoints: 's3p://inin-prod-aps1-analytics/analytics-flink/analytics-flink-v1/3069/checkpoints/HASH', savepoints: 's3p://inin-prod-aps1-analytics/analytics-flink/savepoints/analytics-flink-v1', asynchronous: TRUE, fileStateThreshold: 1048576), localRocksDbDirectories=null, enableIncrementalCheckpointing=TRUE, numberOfTransferThreads=2, writeBatchSize=2097152} 09/08/2021 09:02:27.451 -0400 INFO org.apache.flink.contrib.streaming.state.RocksDBStateBackend Using application-defined options factory: AnalyticsRocksOptionsFactory [baseline=FLASH_SSD_OPTIMIZED, compressionType=ZSTD_COMPRESSION]. 09/08/2021 09:02:27.451 -0400 INFO org.apache.flink.contrib.streaming.state.RocksDBStateBackend Using predefined options: FLASH_SSD_OPTIMIZED. 09/08/2021 09:02:27.451 -0400 INFO org.apache.flink.runtime.jobmaster.JobMaster Using job/cluster config to configure application-defined state backend: RocksDBStateBackend{checkpointStreamBackend=File State Backend (checkpoints: 's3p://inin-prod-aps1-analytics/analytics-flink/analytics-flink-v1/3069/checkpoints/HASH', savepoints: 'null', asynchronous: UNDEFINED, fileStateThreshold: -1), localRocksDbDirectories=null, enableIncrementalCheckpointing=UNDEFINED, numberOfTransferThreads=-1, writeBatchSize=-1} 09/08/2021 09:02:27.449 -0400 INFO org.apache.flink.runtime.util.ZooKeeperUtils Initialized DefaultCompletedCheckpointStore in 'ZooKeeperStateHandleStore{namespace='analytics-flink/analytics-flink-v1/3069/checkpoints/2db4ee6397151a1109d1ca05188a4cbb'}' with /checkpoints/2db4ee6397151a1109d1ca05188a4cbb. 09/08/2021 09:02:27.446 -0400 INFO org.apache.flink.runtime.scheduler.adapter.DefaultExecutionTopology Built 1 pipelined regions in 0 ms 09/08/2021 09:02:27.441 -0400 INFO org.apache.flink.runtime.jobmaster.JobMaster Successfully ran initialization on master in 0 ms. 09/08/2021 09:02:27.441 -0400 INFO org.apache.flink.runtime.jobmaster.JobMaster Running initialization on master for job analytics-flink-v1 (2db4ee6397151a1109d1ca05188a4cbb). 09/08/2021 09:02:27.440 -0400 INFO org.apache.flink.runtime.jobmaster.JobMaster Using restart back off time strategy FixedDelayRestartBackoffTimeStrategy(maxNumberRestartAttempts=2147483647, backoffTimeMS=1000) for analytics-flink-v1 (2db4ee6397151a1109d1ca05188a4cbb). 09/08/2021 09:02:27.430 -0400 INFO org.apache.flink.runtime.jobmaster.JobMaster Initializing job analytics-flink-v1 (2db4ee6397151a1109d1ca05188a4cbb). 09/08/2021 09:02:27.430 -0400 INFO org.apache.flink.runtime.rpc.akka.AkkaRpcService Starting RPC endpoint for org.apache.flink.runtime.jobmaster.JobMaster at akka://flink/user/rpc/jobmanager_4 . 09/08/2021 09:02:27.429 -0400 INFO org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService Starting DefaultLeaderElectionService with ZooKeeperLeaderElectionDriver{leaderPath='/leader/2db4ee6397151a1109d1ca05188a4cbb/job_manager_lock'}. 09/08/2021 09:02:26.535 -0400 INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager Registering TaskManager with ResourceID 10.105.236.109:50004-1e912c (akka.ssl.tcp://f32954a85e3a78e12ad552dafb6935b7:50004/user/rpc/taskmanager_0) at ResourceManager 09/08/2021 09:02:26.331 -0400 INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager Registering TaskManager with ResourceID 10.105.244.116:50004-6095c5 (akka.ssl.tcp://40949cf374b33869a2d6b1e0fd532b7c:50004/user/rpc/taskmanager_0) at ResourceManager 09/08/2021 09:02:26.283 -0400 INFO org.apache.flink.runtime.rpc.akka.AkkaRpcService Starting RPC endpoint for org.apache.flink.runtime.dispatcher.StandaloneDispatcher at akka://flink/user/rpc/dispatcher_3 . 09/08/2021 09:02:26.281 -0400 INFO org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess Successfully recovered 1 persisted job graphs. 09/08/2021 09:02:26.281 -0400 INFO org.apache.flink.runtime.jobmanager.DefaultJobGraphStore Recovered JobGraph(jobId: 2db4ee6397151a1109d1ca05188a4cbb). 09/08/2021 09:02:26.241 -0400 INFO org.apache.flink.runtime.dispatcher.StandaloneDispatcher Could not archive completed job analytics-flink-v1(2db4ee6397151a1109d1ca05188a4cbb) to the history server. java.io.IOException: File already exists:s3p://flink-s3-bucket/history/2db4ee6397151a1109d1ca05188a4cbb at c.f.p.h.s.PrestoS3FileSystem.create(PrestoS3FileSystem.java:357) at o.a.h.fs.FileSystem.create(FileSystem.java:1169) at o.a.h.fs.FileSystem.create(FileSystem.java:1149) at o.a.h.fs.FileSystem.create(FileSystem.java:1038) at o.a.f.f.s.c.HadoopFileSystem.create(HadoopFileSystem.java:154) at o.a.f.f.s.c.HadoopFileSystem.create(HadoopFileSystem.java:37) at o.a.f.c.f.PluginFileSystemFactory$ClassLoaderFixingFileSystem.create(PluginFileSystemFactory.java:170) at o.a.f.r.h.FsJobArchivist.archiveJob(FsJobArchivist.java:73) at o.a.f.r.d.JsonResponseHistoryServerArchivist.lambda$archiveExecutionGraph$0(JsonResponseHistoryServerArchivist.java:57) at o.a.f.u.f.ThrowingRunnable.lambda$unchecked$0(ThrowingRunnable.java:49) ... 4 common frames omitted Wrapped by: j.l.RuntimeException: java.io.IOException: File already exists:s3p://flink-s3-bucket/history/2db4ee6397151a1109d1ca05188a4cbb at o.a.f.u.ExceptionUtils.rethrow(ExceptionUtils.java:316) at o.a.f.u.f.ThrowingRunnable.lambda$unchecked$0(ThrowingRunnable.java:51) at j.u.c.CompletableFuture$AsyncRun.run(CompletableFuture.java:1640) ... 3 common frames omitted Wrapped by: j.u.c.CompletionException: java.lang.RuntimeException: java.io.IOException: File already exists:s3p://flink-s3-bucket/history/2db4ee6397151a1109d1ca05188a4cbb at j.u.c.CompletableFuture.encodeThrowable(CompletableFuture.java:273) at j.u.c.CompletableFuture.completeThrowable(CompletableFuture.java:280) at j.u.c.CompletableFuture$AsyncRun.run(CompletableFuture.java:1643) at j.u.c.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at j.u.c.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 09/08/2021 09:02:26.240 -0400 ERROR org.apache.flink.runtime.history.FsJobArchivist Failed to archive job. java.io.IOException: File already exists:s3p://flink-s3-bucket/history/2db4ee6397151a1109d1ca05188a4cbb at c.f.p.h.s.PrestoS3FileSystem.create(PrestoS3FileSystem.java:357) at o.a.h.fs.FileSystem.create(FileSystem.java:1169) at o.a.h.fs.FileSystem.create(FileSystem.java:1149) at o.a.h.fs.FileSystem.create(FileSystem.java:1038) at o.a.f.f.s.c.HadoopFileSystem.create(HadoopFileSystem.java:154) at o.a.f.f.s.c.HadoopFileSystem.create(HadoopFileSystem.java:37) at o.a.f.c.f.PluginFileSystemFactory$ClassLoaderFixingFileSystem.create(PluginFileSystemFactory.java:170) at o.a.f.r.h.FsJobArchivist.archiveJob(FsJobArchivist.java:73) at o.a.f.r.d.JsonResponseHistoryServerArchivist.lambda$archiveExecutionGraph$0(JsonResponseHistoryServerArchivist.java:57) at o.a.f.u.f.ThrowingRunnable.lambda$unchecked$0(ThrowingRunnable.java:49) at j.u.c.CompletableFuture$AsyncRun.run(CompletableFuture.java:1640) at j.u.c.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at j.u.c.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 09/08/2021 09:02:26.227 -0400 INFO org.apache.flink.runtime.jobmanager.ZooKeeperJobGraphStoreWatcher Stopping ZooKeeperJobGraphStoreWatcher 09/08/2021 09:02:26.212 -0400 INFO org.apache.flink.runtime.jobmanager.DefaultJobGraphStore Stopping DefaultJobGraphStore. 09/08/2021 09:02:26.211 -0400 INFO org.apache.flink.runtime.dispatcher.StandaloneDispatcher Stopped dispatcher akka.ssl.tcp://1dd8e04affb77f1da7ab5f7c9202b570:50001/user/rpc/dispatcher_1. 09/08/2021 09:02:26.211 -0400 INFO org.apache.flink.runtime.rest.handler.legacy.backpressure.BackPressureRequestCoordinator Shutting down back pressure request coordinator. 09/08/2021 09:02:26.201 -0400 INFO org.apache.flink.runtime.jobmanager.DefaultJobGraphStore Released job graph 2db4ee6397151a1109d1ca05188a4cbb from ZooKeeperStateHandleStore{namespace='analytics-flink/analytics-flink-v1/3069/jobgraphs'}. 09/08/2021 09:02:26.192 -0400 INFO org.apache.flink.runtime.dispatcher.StandaloneDispatcher Job 2db4ee6397151a1109d1ca05188a4cbb reached terminal state SUSPENDED. 09/08/2021 09:02:26.191 -0400 INFO org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionDriver Closing ZooKeeperLeaderElectionDriver{leaderPath='/leader/2db4ee6397151a1109d1ca05188a4cbb/job_manager_lock'} 09/08/2021 09:02:26.191 -0400 INFO org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService Stopping DefaultLeaderElectionService. 09/08/2021 09:02:26.190 -0400 INFO org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess Trying to recover job with job id 2db4ee6397151a1109d1ca05188a4cbb. 09/08/2021 09:02:26.190 -0400 INFO org.apache.flink.runtime.jobmanager.DefaultJobGraphStore Retrieved job ids [2db4ee6397151a1109d1ca05188a4cbb] from ZooKeeperStateHandleStore{namespace='analytics-flink/analytics-flink-v1/3069/jobgraphs'} 09/08/2021 09:02:26.189 -0400 INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager Registering TaskManager with ResourceID 10.105.217.197:50004-6176ad (akka.ssl.tcp://0a11cfda1c06f5500298d05e94d88c13:50004/user/rpc/taskmanager_0) at ResourceManager 09/08/2021 09:02:26.182 -0400 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl Stopping SlotPool. 09/08/2021 09:02:26.182 -0400 INFO org.apache.flink.runtime.jobmaster.JobMaster Close ResourceManager connection fb0bf924810087592ad931aecb0387b1: Stopping JobMaster for job analytics-flink-v1(2db4ee6397151a1109d1ca05188a4cbb).. 09/08/2021 09:02:26.182 -0400 INFO org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess Recover all persisted job graphs. 09/08/2021 09:02:26.182 -0400 INFO org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess Start SessionDispatcherLeaderProcess. 09/08/2021 09:02:26.181 -0400 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl Suspending SlotPool. 09/08/2021 09:02:26.178 -0400 INFO org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter Shutting down. 09/08/2021 09:02:26.164 -0400 WARN org.apache.flink.metrics.MetricGroup Name collision: Group already contains a Metric with the name 'taskSlotsTotal'. Metric will not be reported.[jobmanager, 10.105.221.188] 09/08/2021 09:02:26.164 -0400 WARN org.apache.flink.metrics.MetricGroup Name collision: Group already contains a Metric with the name 'taskSlotsAvailable'. Metric will not be reported.[jobmanager, 10.105.221.188] 09/08/2021 09:02:26.163 -0400 INFO org.apache.flink.runtime.resourcemanager.slotmanager.SlotManagerImpl Starting the SlotManager. 09/08/2021 09:02:26.163 -0400 INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager ResourceManager akka.ssl.tcp://0aaaf7660b7f4414af376af4f91f9500:50001/user/rpc/resourcemanager_0 was granted leadership with fencing token b0f998b265508bb4cd715749318a4ce4 09/08/2021 09:02:26.161 -0400 INFO org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionDriver Connection to ZooKeeper was reconnected. Leader election can be restarted. 09/08/2021 09:02:26.161 -0400 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalDriver Connection to ZooKeeper was reconnected. Leader retrieval can be restarted. 09/08/2021 09:02:26.160 -0400 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalDriver Connection to ZooKeeper was reconnected. Leader retrieval can be restarted. 09/08/2021 09:02:26.160 -0400 INFO org.apache.flink.runtime.jobmanager.ZooKeeperJobGraphStoreWatcher ZooKeeper connection RECONNECTED. Changes to the submitted job graphs are monitored again. 09/08/2021 09:02:26.160 -0400 INFO org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionDriver Connection to ZooKeeper was reconnected. Leader election can be restarted. 09/08/2021 09:02:26.160 -0400 INFO org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionDriver Connection to ZooKeeper was reconnected. Leader election can be restarted. 09/08/2021 09:02:26.159 -0400 INFO org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionDriver Connection to ZooKeeper was reconnected. Leader election can be restarted. 09/08/2021 09:02:26.159 -0400 INFO org.apache.flink.shaded.curator4.org.apache.curator.framework.state.ConnectionStateManager State change: RECONNECTED 09/08/2021 09:02:26.159 -0400 INFO org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn Session establishment complete on server zkeeper-2/10.105.219.52:2181, sessionid = 0x30000016c7d978a, negotiated timeout = 10000 09/08/2021 09:02:26.158 -0400 INFO org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn Socket connection established to zkeeper-2/10.105.219.52:2181, initiating session 09/08/2021 09:02:26.157 -0400 INFO org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn Opening socket connection to server zkeeper-2/10.105.219.52:2181 09/08/2021 09:02:25.274 -0400 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalDriver Connection to ZooKeeper was reconnected. Leader retrieval can be restarted. 09/08/2021 09:02:25.273 -0400 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalDriver Connection to ZooKeeper was reconnected. Leader retrieval can be restarted. 09/08/2021 09:02:25.273 -0400 INFO org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionDriver Connection to ZooKeeper was reconnected. Leader election can be restarted. 09/08/2021 09:02:25.273 -0400 INFO org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionDriver Connection to ZooKeeper was reconnected. Leader election can be restarted. 09/08/2021 09:02:25.273 -0400 INFO org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionDriver Connection to ZooKeeper was reconnected. Leader election can be restarted. 09/08/2021 09:02:25.273 -0400 INFO org.apache.flink.shaded.curator4.org.apache.curator.framework.state.ConnectionStateManager State change: RECONNECTED 09/08/2021 09:02:25.273 -0400 INFO org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn Session establishment complete on server zkeeper-1/10.105.253.30:2181, sessionid = 0x30000016c7d9789, negotiated timeout = 10000 09/08/2021 09:02:25.271 -0400 INFO org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn Socket connection established to zkeeper-1/10.105.253.30:2181, initiating session 09/08/2021 09:02:25.271 -0400 INFO org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn Opening socket connection to server zkeeper-1/10.105.253.30:2181 09/08/2021 09:02:25.120 -0400 INFO org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn Socket error occurred: zkeeper-3/10.105.233.83:2181: Connection refused 09/08/2021 09:02:25.119 -0400 INFO org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn Opening socket connection to server zkeeper-3/10.105.233.83:2181 09/08/2021 09:02:25.027 -0400 INFO org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn Socket error occurred: zkeeper-3/10.105.233.83:2181: Connection refused 09/08/2021 09:02:25.026 -0400 INFO org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn Opening socket connection to server zkeeper-3/10.105.233.83:2181 09/08/2021 09:02:23.738 -0400 INFO org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn Unable to read additional data from server sessionid 0x30000016c7d978a, likely server has closed socket, closing socket connection and attempting reconnect 09/08/2021 09:02:23.737 -0400 INFO org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn Socket connection established to zkeeper-1/10.105.253.30:2181, initiating session 09/08/2021 09:02:23.737 -0400 INFO org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn Opening socket connection to server zkeeper-1/10.105.253.30:2181 09/08/2021 09:02:23.646 -0400 INFO org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn Unable to read additional data from server sessionid 0x30000016c7d9789, likely server has closed socket, closing socket connection and attempting reconnect 09/08/2021 09:02:23.645 -0400 INFO org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn Socket connection established to zkeeper-2/10.105.219.52:2181, initiating session 09/08/2021 09:02:23.644 -0400 INFO org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn Opening socket connection to server zkeeper-2/10.105.219.52:2181 09/08/2021 09:02:23.507 -0400 INFO org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn Unable to read additional data from server sessionid 0x30000016c7d9789, likely server has closed socket, closing socket connection and attempting reconnect 09/08/2021 09:02:23.506 -0400 INFO org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn Socket connection established to zkeeper-1/10.105.253.30:2181, initiating session 09/08/2021 09:02:23.505 -0400 INFO org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn Opening socket connection to server zkeeper-1/10.105.253.30:2181 09/08/2021 09:02:22.879 -0400 INFO org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn Unable to read additional data from server sessionid 0x30000016c7d978a, likely server has closed socket, closing socket connection and attempting reconnect 09/08/2021 09:02:22.878 -0400 INFO org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn Socket connection established to zkeeper-2/10.105.219.52:2181, initiating session 09/08/2021 09:02:22.877 -0400 INFO org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn Opening socket connection to server zkeeper-2/10.105.219.52:2181 09/08/2021 09:02:22.742 -0400 INFO org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore Suspending 09/08/2021 09:02:22.725 -0400 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalDriver Closing ZookeeperLeaderRetrievalDriver{retrievalPath='/leader/resource_manager_lock'}. 09/08/2021 09:02:22.725 -0400 INFO org.apache.flink.runtime.leaderretrieval.DefaultLeaderRetrievalService Stopping DefaultLeaderRetrievalService. 09/08/2021 09:02:22.724 -0400 INFO org.apache.flink.runtime.jobmaster.JobMaster Stopping the JobMaster for job analytics-flink-v1(2db4ee6397151a1109d1ca05188a4cbb). 09/08/2021 09:02:22.679 -0400 WARN org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionDriver Connection to ZooKeeper suspended. The contender https://10.105.245.207:8081 no longer participates in the leader election. 09/08/2021 09:02:22.679 -0400 WARN org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalDriver Connection to ZooKeeper suspended. Can no longer retrieve the leader from ZooKeeper. 09/08/2021 09:02:22.679 -0400 WARN org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalDriver Connection to ZooKeeper suspended. Can no longer retrieve the leader from ZooKeeper. 09/08/2021 09:02:22.679 -0400 WARN org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalDriver Connection to ZooKeeper suspended. Can no longer retrieve the leader from ZooKeeper. 09/08/2021 09:02:22.679 -0400 WARN org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionDriver Connection to ZooKeeper suspended. The contender LeaderContender: JobManagerRunnerImpl no longer participates in the leader election. 09/08/2021 09:02:22.679 -0400 WARN org.apache.flink.runtime.jobmanager.ZooKeeperJobGraphStoreWatcher ZooKeeper connection SUSPENDING. Changes to the submitted job graphs are not monitored (temporarily). 09/08/2021 09:02:22.677 -0400 INFO org.apache.flink.runtime.dispatcher.StandaloneDispatcher Stopping all currently running jobs of dispatcher akka.ssl.tcp://1dd8e04affb77f1da7ab5f7c9202b570:50001/user/rpc/dispatcher_1. 09/08/2021 09:02:22.677 -0400 INFO org.apache.flink.runtime.dispatcher.StandaloneDispatcher Stopping dispatcher akka.ssl.tcp://1dd8e04affb77f1da7ab5f7c9202b570:50001/user/rpc/dispatcher_1. 09/08/2021 09:02:22.677 -0400 INFO org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess Stopping SessionDispatcherLeaderProcess. 09/08/2021 09:02:22.677 -0400 INFO org.apache.flink.runtime.resourcemanager.slotmanager.SlotManagerImpl Suspending the SlotManager. 09/08/2021 09:02:22.676 -0400 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalDriver Closing ZookeeperLeaderRetrievalDriver{retrievalPath='/leader/2db4ee6397151a1109d1ca05188a4cbb/job_manager_lock'}. 09/08/2021 09:02:22.676 -0400 INFO org.apache.flink.runtime.leaderretrieval.DefaultLeaderRetrievalService Stopping DefaultLeaderRetrievalService. 09/08/2021 09:02:22.676 -0400 INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager ResourceManager akka.ssl.tcp://1dd8e04affb77f1da7ab5f7c9202b570:50001/user/rpc/resourcemanager_0 was revoked leadership. Clearing fencing token. 09/08/2021 09:02:22.676 -0400 WARN org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionDriver Connection to ZooKeeper suspended. The contender LeaderContender: DefaultDispatcherRunner no longer participates in the leader election. 09/08/2021 09:02:22.676 -0400 INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager Closing TaskExecutor connection 10.105.217.197:50004-6176ad because: ResourceManager leader changed to new address null 09/08/2021 09:02:22.675 -0400 INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager Closing TaskExecutor connection 10.105.236.109:50004-1e912c because: ResourceManager leader changed to new address null 09/08/2021 09:02:22.675 -0400 WARN org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionDriver Connection to ZooKeeper suspended. The contender LeaderContender: StandaloneResourceManager no longer participates in the leader election. 09/08/2021 09:02:22.675 -0400 INFO org.apache.flink.shaded.curator4.org.apache.curator.framework.state.ConnectionStateManager State change: SUSPENDED 09/08/2021 09:02:22.673 -0400 WARN org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalDriver Connection to ZooKeeper suspended. Can no longer retrieve the leader from ZooKeeper. 09/08/2021 09:02:22.673 -0400 WARN org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalDriver Connection to ZooKeeper suspended. Can no longer retrieve the leader from ZooKeeper. 09/08/2021 09:02:22.673 -0400 WARN org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionDriver Connection to ZooKeeper suspended. The contender LeaderContender: StandaloneResourceManager no longer participates in the leader election. 09/08/2021 09:02:22.673 -0400 WARN org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionDriver Connection to ZooKeeper suspended. The contender https://10.105.221.188:8081 no longer participates in the leader election. 09/08/2021 09:02:22.673 -0400 WARN org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionDriver Connection to ZooKeeper suspended. The contender LeaderContender: DefaultDispatcherRunner no longer participates in the leader election. 09/08/2021 09:02:22.673 -0400 INFO org.apache.flink.shaded.curator4.org.apache.curator.framework.state.ConnectionStateManager State change: SUSPENDED 09/08/2021 09:02:22.574 -0400 INFO org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn Unable to read additional data from server sessionid 0x30000016c7d978a, likely server has closed socket, closing socket connection and attempting reconnect 09/08/2021 09:02:22.573 -0400 INFO org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn Unable to read additional data from server sessionid 0x30000016c7d9789, likely server has closed socket, closing socket connection and attempting reconnect