Hi all! I'm using Flink 1.14.4 along with kubernetes operator version 1.1.0, sometimes kubernetes operator restarts the cluster after changing the flinkdeployment object (with saving savepoint ), the new jobmanager which created exits right after start.
2022-08-18T06:47:52.627825838Z DEBUG org.apache.flink.kubernetes.highavailability.KubernetesLeaderElectionDriver [] - Successfully wrote leader information: Leader=http://10.109.0.42:8081, session ID=4083cf37-1f54-4777-87af-a7c032ba1a3e. 2022-08-18T06:47:52.719469481Z INFO org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - Starting the resource manager. 2022-08-18T06:47:52.724988536Z DEBUG org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService [] - Confirm leader session ID 9e88e696-1cd7-41a1-bbcd-16c7205a2490 for leader akka.tcp://flink@10.109.0.42:6123/user/rpc/dispatcher_0. 2022-08-18T06:47:52.725003235Z INFO org.apache.flink.client.ClientUtils [] - Starting program (detached: true) 2022-08-18T06:47:52.819319769Z DEBUG org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager [] - Starting the slot manager. 2022-08-18T06:47:52.927062157Z DEBUG org.apache.flink.kubernetes.highavailability.KubernetesLeaderElectionDriver [] - Successfully wrote leader information: Leader=akka.tcp://flink@10.109.0.42:6123/user/rpc/dispatcher_0, session ID=9e88e696-1cd7-41a1-bbcd-16c7205a2490. 2022-08-18T06:47:53.019417156Z DEBUG org.apache.flink.runtime.leaderretrieval.DefaultLeaderRetrievalService [] - New leader information: Leader=akka.tcp://flink@10.109.0.42:6123/user/rpc/dispatcher_0, session ID=9e88e696-1cd7-41a1-bbcd-16c7205a2490. 2022-08-18T06:47:53.019454163Z DEBUG org.apache.flink.runtime.rpc.akka.AkkaRpcService [] - Try to connect to remote RPC endpoint with address akka.tcp://flink@10.109.0.42:6123/user/rpc/dispatcher_0. Returning a org.apache.flink.runtime.dispatcher.DispatcherGateway gateway. 2022-08-18T06:47:53.828514180Z INFO org.apache.flink.contrib.streaming.state.EmbeddedRocksDBStateBackend [] - Using predefined options: DEFAULT. 2022-08-18T06:47:53.828748188Z INFO org.apache.flink.contrib.streaming.state.EmbeddedRocksDBStateBackend [] - Using default options factory: DefaultConfigurableOptionsFactory{configuredOptions={state.backend.rocksdb.thread.num=4}}. 2022-08-18T06:47:54.323813798Z INFO org.apache.flink.kubernetes.KubernetesResourceManagerDriver [] - Recovered 0 pods from previous attempts, current attempt id is 2. 2022-08-18T06:47:54.323857346Z INFO org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - Recovered 0 workers from previous attempt. 2022-08-18T06:47:54.324174540Z DEBUG org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - Trigger heartbeat request. 2022-08-18T06:47:54.324249551Z DEBUG org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService [] - Confirm leader session ID a1d34254-4a1b-4679-8ef5-736b68ed2d18 for leader akka.tcp://flink@10.109.0.42:6123/user/rpc/resourcemanager_1. 2022-08-18T06:47:54.324298408Z DEBUG org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - Trigger heartbeat request. 2022-08-18T06:47:54.426306683Z DEBUG org.apache.flink.runtime.leaderretrieval.DefaultLeaderRetrievalService [] - New leader information: Leader=akka.tcp://flink@10.109.0.42:6123/user/rpc/resourcemanager_1, session ID=a1d34254-4a1b-4679-8ef5-736b68ed2d18. 2022-08-18T06:47:54.426327318Z DEBUG org.apache.flink.runtime.rpc.akka.AkkaRpcService [] - Try to connect to remote RPC endpoint with address akka.tcp://flink@10.109.0.42:6123/user/rpc/resourcemanager_1. Returning a org.apache.flink.runtime.resourcemanager.ResourceManagerGateway gateway. 2022-08-18T06:47:54.426559526Z DEBUG org.apache.flink.kubernetes.highavailability.KubernetesLeaderElectionDriver [] - Successfully wrote leader information: Leader=akka.tcp://flink@10.109.0.42:6123/user/rpc/resourcemanager_1, session ID=a1d34254-4a1b-4679-8ef5-736b68ed2d18. 2022-08-18T06:47:56.028429878Z INFO org.apache.flink.client.deployment.application.executors.EmbeddedExecutor [] - Job 00000000000000000000000000000000 is submitted. 2022-08-18T06:47:56.028642950Z INFO org.apache.flink.client.deployment.application.executors.EmbeddedExecutor [] - Submitting Job with JobId=00000000000000000000000000000000. 2022-08-18T06:47:56.119384473Z DEBUG org.apache.flink.runtime.blob.BlobClient [] - PUT BLOB stream to /127.0.0.1:60946. 2022-08-18T06:47:56.119744634Z DEBUG org.apache.flink.runtime.blob.BlobServerConnection [] - Received PUT request for BLOB of job 00000000000000000000000000000000 with from /127.0.0.1. 2022-08-18T06:47:57.121306489Z DEBUG org.apache.flink.runtime.blob.FileSystemBlobStore [] - Copying from /opt/flink/tmp/blobStore-d964fcd9-46bf-4b3e-8795-bf4de9018f74/job_00000000000000000000000000000000/blob_p-208ad2c1532579afee6fba04add90e38b2c65bf0-c6776ad9431526449faaf8732e0e3a69 to s3p://flink-checkpoints/k8s-ha-my-job/my-job/blob/job_00000000000000000000000000000000/blob_p-208ad2c1532579afee6fba04add90e38b2c65bf0-c6776ad9431526449faaf8732e0e3a69. 2022-08-18T06:47:57.220780565Z DEBUG org.apache.flink.shaded.netty4.io.netty.util.ResourceLeakDetectorFactory [] - Loaded default ResourceLeakDetector: org.apache.flink.shaded.netty4.io.netty.util.ResourceLeakDetector@4c77dc87 2022-08-18T06:47:58.146460631Z WARN com.amazonaws.services.s3.internal.Mimetypes [] - Unable to find 'mime.types' file in classpath 2022-08-18T06:48:02.061555069Z INFO org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Received JobGraph submission 'my-job' (00000000000000000000000000000000). 2022-08-18T06:48:02.086475676Z INFO org.apache.flink.client.deployment.application.ApplicationDispatcherBootstrap [] - Application completed SUCCESSFULLY 2022-08-18T06:48:02.086916494Z INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint [] - Shutting KubernetesApplicationClusterEntrypoint down with application status SUCCEEDED. Diagnostics null. 2022-08-18T06:48:02.087327641Z INFO org.apache.flink.runtime.jobmaster.MiniDispatcherRestEndpoint [] - Shutting down rest endpoint. 2022-08-18T06:48:02.092730934Z DEBUG org.apache.flink.shaded.netty4.io.netty.buffer.PoolThreadCache [] - Freed 1 thread-local buffer(s) from thread: flink-rest-server-netty-worker-thread-1 2022-08-18T06:48:02.118520298Z INFO org.apache.flink.runtime.jobmaster.MiniDispatcherRestEndpoint [] - Removing cache directory /tmp/flink-web-3e4b5ae9-07b2-4040-b0c5-d583dc7e24f7/flink-web-ui 2022-08-18T06:48:02.118954834Z INFO org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService [] - Stopping DefaultLeaderElectionService. 2022-08-18T06:48:02.118994602Z INFO org.apache.flink.kubernetes.highavailability.KubernetesLeaderElectionDriver [] - Closing KubernetesLeaderElectionDriver{configMapName='my-job-restserver-leader'}. 2022-08-18T06:48:02.119073057Z INFO org.apache.flink.runtime.jobmaster.MiniDispatcherRestEndpoint [] - Shut down complete. I have to delete and re-create the flinkdeployment object to restore the cluster. What conditions lead to the jobmanager 'Application completed SUCCESSFULLY' at start? ________________________________ "This message contains confidential information/commercial secret. If you are not the intended addressee of this message you may not copy, save, print or forward it to any third party and you are kindly requested to destroy this message and notify the sender thereof by email. Данное сообщение содержит конфиденциальную информацию/информацию, являющуюся коммерческой тайной. Если Вы не являетесь надлежащим адресатом данного сообщения, Вы не вправе копировать, сохранять, печатать или пересылать его каким либо иным лицам. Просьба уничтожить данное сообщение и уведомить об этом отправителя электронным письмом."