Hello! In our k8s application cluster (served by flink-operator) several jobs restart at the same time with the same error.
What is the reason for this restart and how can it be prevented? 2022-11-25T07:50:47.253459360Z INFO org.apache.flink.runtime.resourcemanager.ResourceManagerServiceImpl [] - Resource manager service is revoked leadership with session id 5e76a7e2-0a88-4cff-b371-0a36f2b4cebd. 2022-11-25T07:50:47.257093040Z INFO org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager [] - Closing the slot manager. 2022-11-25T07:50:47.257145141Z INFO org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager [] - Suspending the slot manager. 2022-11-25T07:50:47.258932353Z INFO org.apache.flink.runtime.leaderretrieval.DefaultLeaderRetrievalService [] - Stopping DefaultLeaderRetrievalService. 2022-11-25T07:50:47.258974224Z INFO org.apache.flink.kubernetes.highavailability.KubernetesLeaderRetrievalDriver [] - Stopping KubernetesLeaderRetrievalDriver{configMapName='job-name-00000000000000000000000000000000-jobmanager-leader'}. 2022-11-25T07:50:47.259457605Z INFO org.apache.flink.kubernetes.kubeclient.resources.KubernetesConfigMapSharedInformer [] - Stopped to watch for job-name/job-name-00000000000000000000000000000000-jobmanager-leader, watching id:c310c788-946a-4afd-8aeb-debd99a9045d 2022-11-25T07:50:47.351610077Z INFO org.apache.flink.runtime.dispatcher.runner.DefaultDispatcherRunner [] - DefaultDispatcherRunner was revoked the leadership with leader id dc208675-e994-4697-b94c-542fd52e2046. Stopping the DispatcherLeaderProcess. 2022-11-25T07:50:47.352119375Z INFO org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess [] - Stopping SessionDispatcherLeaderProcess. 2022-11-25T07:50:47.352310045Z INFO org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Stopping dispatcher akka.tcp://flink@10.156.130.89:6123/user/rpc/dispatcher_0. 2022-11-25T07:50:47.352349213Z INFO org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Stopping all currently running jobs of dispatcher akka.tcp://flink@10.156.130.89:6123/user/rpc/dispatcher_0. 2022-11-25T07:50:47.353150035Z INFO org.apache.flink.runtime.jobmaster.JobMaster [] - Stopping the JobMaster for job 'job-name' (00000000000000000000000000000000). 2022-11-25T07:50:47.363572384Z INFO org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Job 00000000000000000000000000000000 reached terminal state SUSPENDED. 2022-11-25T07:50:47.364190310Z INFO org.apache.flink.runtime.jobmanager.DefaultJobGraphStore [] - Released job graph 00000000000000000000000000000000 from KubernetesStateHandleStore{configMapName='job-name-dispatcher-leader'}. 2022-11-25T07:50:47.366146897Z INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Job job-name (00000000000000000000000000000000) switched from state RUNNING to SUSPENDED. 2022-11-25T07:50:47.366172044Z org.apache.flink.util.FlinkException: Scheduler is being stopped. 2022-11-25T07:50:47.366178811Z at org.apache.flink.runtime.scheduler.SchedulerBase.closeAsync(SchedulerBase.java:600) ~[flink-dist_2.12-1.14.4.jar:1.14.4] 2022-11-25T07:50:47.366184972Z at org.apache.flink.runtime.jobmaster.JobMaster.stopScheduling(JobMaster.java:972) ~[flink-dist_2.12-1.14.4.jar:1.14.4] 2022-11-25T07:50:47.366190500Z at org.apache.flink.runtime.jobmaster.JobMaster.stopJobExecution(JobMaster.java:935) ~[flink-dist_2.12-1.14.4.jar:1.14.4] 2022-11-25T07:50:47.366196302Z at org.apache.flink.runtime.jobmaster.JobMaster.onStop(JobMaster.java:407) ~[flink-dist_2.12-1.14.4.jar:1.14.4] 2022-11-25T07:50:47.366201692Z at org.apache.flink.runtime.rpc.RpcEndpoint.internalCallOnStop(RpcEndpoint.java:214) ~[flink-dist_2.12-1.14.4.jar:1.14.4] 2022-11-25T07:50:47.366207681Z at org.apache.flink.runtime.rpc.akka.AkkaRpcActor$StartedState.lambda$terminate$0(AkkaRpcActor.java:580) ~[flink-rpc-akka_de9aa37f-fc7d-4780-8d43-5715ee860795.jar:1.14.4] 2022-11-25T07:50:47.366213755Z at org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:83) ~[flink-rpc-akka_de9aa37f-fc7d-4780-8d43-5715ee860795.jar:1.14.4] 2022-11-25T07:50:47.366220099Z at org.apache.flink.runtime.rpc.akka.AkkaRpcActor$StartedState.terminate(AkkaRpcActor.java:579) ~[flink-rpc-akka_de9aa37f-fc7d-4780-8d43-5715ee860795.jar:1.14.4] 2022-11-25T07:50:47.366225868Z at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleControlMessage(AkkaRpcActor.java:191) ~[flink-rpc-akka_de9aa37f-fc7d-4780-8d43-5715ee860795.jar:1.14.4] 2022-11-25T07:50:47.366231897Z at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:24) [flink-rpc-akka_de9aa37f-fc7d-4780-8d43-5715ee860795.jar:1.14.4] 2022-11-25T07:50:47.366237418Z at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:20) [flink-rpc-akka_de9aa37f-fc7d-4780-8d43-5715ee860795.jar:1.14.4] 2022-11-25T07:50:47.366243262Z at scala.PartialFunction.applyOrElse(PartialFunction.scala:123) [flink-rpc-akka_de9aa37f-fc7d-4780-8d43-5715ee860795.jar:1.14.4] 2022-11-25T07:50:47.366250022Z at scala.PartialFunction.applyOrElse$(PartialFunction.scala:122) [flink-rpc-akka_de9aa37f-fc7d-4780-8d43-5715ee860795.jar:1.14.4] 2022-11-25T07:50:47.366256203Z at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:20) [flink-rpc-akka_de9aa37f-fc7d-4780-8d43-5715ee860795.jar:1.14.4] 2022-11-25T07:50:47.366261762Z at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) [flink-rpc-akka_de9aa37f-fc7d-4780-8d43-5715ee860795.jar:1.14.4] 2022-11-25T07:50:47.366266861Z at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172) [flink-rpc-akka_de9aa37f-fc7d-4780-8d43-5715ee860795.jar:1.14.4] 2022-11-25T07:50:47.366287887Z at akka.actor.Actor.aroundReceive(Actor.scala:537) [flink-rpc-akka_de9aa37f-fc7d-4780-8d43-5715ee860795.jar:1.14.4] 2022-11-25T07:50:47.366292783Z at akka.actor.Actor.aroundReceive$(Actor.scala:535) [flink-rpc-akka_de9aa37f-fc7d-4780-8d43-5715ee860795.jar:1.14.4] 2022-11-25T07:50:47.366297259Z at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:220) [flink-rpc-akka_de9aa37f-fc7d-4780-8d43-5715ee860795.jar:1.14.4] 2022-11-25T07:50:47.366302193Z at akka.actor.ActorCell.receiveMessage(ActorCell.scala:580) [flink-rpc-akka_de9aa37f-fc7d-4780-8d43-5715ee860795.jar:1.14.4] 2022-11-25T07:50:47.366307938Z at akka.actor.ActorCell.invoke(ActorCell.scala:548) [flink-rpc-akka_de9aa37f-fc7d-4780-8d43-5715ee860795.jar:1.14.4] 2022-11-25T07:50:47.366313472Z at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:270) [flink-rpc-akka_de9aa37f-fc7d-4780-8d43-5715ee860795.jar:1.14.4] 2022-11-25T07:50:47.366318936Z at akka.dispatch.Mailbox.run(Mailbox.scala:231) [flink-rpc-akka_de9aa37f-fc7d-4780-8d43-5715ee860795.jar:1.14.4] 2022-11-25T07:50:47.366324575Z at akka.dispatch.Mailbox.exec(Mailbox.scala:243) [flink-rpc-akka_de9aa37f-fc7d-4780-8d43-5715ee860795.jar:1.14.4] 2022-11-25T07:50:47.366329814Z at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:290) [?:?] 2022-11-25T07:50:47.366334303Z at java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1020) [?:?] 2022-11-25T07:50:47.366338898Z at java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1656) [?:?] 2022-11-25T07:50:47.366343602Z at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1594) [?:?] 2022-11-25T07:50:47.366348796Z at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:183) [?:?] 2022-11-25T07:50:47.555434515Z INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Stopping checkpoint coordinator for job 00000000000000000000000000000000. 2022-11-25T07:50:47.555577858Z INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Job 00000000000000000000000000000000 has been suspended. 2022-11-25T07:50:47.556533450Z INFO org.apache.flink.runtime.source.coordinator.SourceCoordinator [] - Closing SourceCoordinator for source Source: topic. 2022-11-25T07:50:47.558037037Z INFO org.apache.kafka.common.metrics.Metrics [] - Metrics scheduler closed 2022-11-25T07:50:47.558066841Z INFO org.apache.kafka.common.metrics.Metrics [] - Closing reporter org.apache.kafka.common.metrics.JmxReporter 2022-11-25T07:50:47.558098042Z INFO org.apache.kafka.common.metrics.Metrics [] - Metrics reporters closed 2022-11-25T07:50:47.559339549Z INFO org.apache.kafka.common.utils.AppInfoParser [] - App info kafka.consumer for flink-job-name-topic-enumerator-consumer unregistered 2022-11-25T07:50:47.560098212Z INFO org.apache.kafka.common.utils.AppInfoParser [] - App info kafka.admin.client for flink-job-name-topic-enumerator-admin-client unregistered 2022-11-25T07:50:47.561368999Z INFO org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore [] - Suspending 2022-11-25T07:50:47.561424898Z INFO org.apache.flink.kubernetes.highavailability.KubernetesCheckpointIDCounter [] - Shutting down. 2022-11-25T07:50:47.561490412Z INFO org.apache.kafka.common.metrics.Metrics [] - Metrics scheduler closed 2022-11-25T07:50:47.561540717Z INFO org.apache.kafka.common.metrics.Metrics [] - Closing reporter org.apache.kafka.common.metrics.JmxReporter 2022-11-25T07:50:47.561556978Z INFO org.apache.kafka.common.metrics.Metrics [] - Metrics reporters closed 2022-11-25T07:50:47.561718430Z INFO org.apache.flink.runtime.source.coordinator.SourceCoordinator [] - Source coordinator for source Source: topic closed. 2022-11-25T07:50:47.654239581Z INFO org.apache.flink.runtime.jobmaster.JobMaster [] - Close ResourceManager connection fe39d6fe5e1ac5a66e3afcdd2d900e3f: Stopping JobMaster for job 'job-name' (00000000000000000000000000000000). 2022-11-25T07:50:47.654412182Z INFO org.apache.flink.runtime.leaderretrieval.DefaultLeaderRetrievalService [] - Stopping DefaultLeaderRetrievalService. 2022-11-25T07:50:47.654523159Z INFO org.apache.flink.kubernetes.highavailability.KubernetesLeaderRetrievalDriver [] - Stopping KubernetesLeaderRetrievalDriver{configMapName='job-name-resourcemanager-leader'}. 2022-11-25T07:50:47.654622715Z INFO org.apache.flink.kubernetes.kubeclient.resources.KubernetesConfigMapSharedInformer [] - Stopped to watch for job-name/job-name-resourcemanager-leader, watching id:7c16cf70-0a82-4c2b-bcec-f824f75c39cc 2022-11-25T07:50:47.656576691Z INFO org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService [] - Stopping DefaultLeaderElectionService. 2022-11-25T07:50:47.656753283Z INFO org.apache.flink.kubernetes.highavailability.KubernetesLeaderElectionDriver [] - Closing KubernetesLeaderElectionDriver{configMapName='job-name-00000000000000000000000000000000-jobmanager-leader'}. 2022-11-25T07:50:47.656896536Z INFO org.apache.flink.kubernetes.kubeclient.resources.KubernetesConfigMapSharedInformer [] - Stopped to watch for job-name/job-name-00000000000000000000000000000000-jobmanager-leader, watching id:f670586e-5f1e-4520-a50a-d7e81b9a655e 2022-11-25T07:50:47.659729550Z WARN org.apache.flink.client.deployment.application.ApplicationDispatcherBootstrap [] - Application has been cancelled because the ApplicationDispatcherBootstrap is being stopped. 2022-11-25T07:50:47.661631156Z INFO org.apache.flink.runtime.jobmaster.MiniDispatcherRestEndpoint [] - http://10.156.130.89:8081 was granted leadership with leaderSessionID=ea5e4b77-dca8-4a61-a25a-05027b0391eb 2022-11-25T07:50:47.748098049Z INFO org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Stopped dispatcher akka.tcp://flink@10.156.130.89:6123/user/rpc/dispatcher_0. 2022-11-25T07:50:47.749450398Z INFO org.apache.flink.runtime.jobmanager.DefaultJobGraphStore [] - Stopping DefaultJobGraphStore. 2022-11-25T07:50:47.749545245Z INFO org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess [] - Start SessionDispatcherLeaderProcess. 2022-11-25T07:50:47.749609886Z INFO org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess [] - Recover all persisted job graphs. 2022-11-25T07:50:47.805729406Z INFO org.apache.flink.runtime.jobmanager.DefaultJobGraphStore [] - Retrieved job ids [00000000000000000000000000000000] from KubernetesStateHandleStore{configMapName='job-name-dispatcher-leader'} 2022-11-25T07:50:47.805757899Z INFO org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess [] - Trying to recover job with job id 00000000000000000000000000000000. 2022-11-25T07:50:47.982753015Z INFO org.apache.flink.runtime.jobmanager.DefaultJobGraphStore [] - Recovered JobGraph(jobId: 00000000000000000000000000000000). 2022-11-25T07:50:47.982792610Z INFO org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess [] - Successfully recovered 1 persisted job graphs. 2022-11-25T07:50:47.984388298Z INFO org.apache.flink.runtime.rpc.akka.AkkaRpcService [] - Starting RPC endpoint for org.apache.flink.runtime.dispatcher.StandaloneDispatcher at akka://flink/user/rpc/dispatcher_3 . 2022-11-25T07:50:48.263187219Z INFO org.apache.flink.kubernetes.kubeclient.resources.KubernetesLeaderElector [] - Create KubernetesLeaderElector job-name-00000000000000000000000000000000-jobmanager-leader with lock identity 01e96e8e-11c9-4609-85fc-2add4f7f20a9. 2022-11-25T07:50:48.263488651Z INFO org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService [] - Starting DefaultLeaderElectionService with KubernetesLeaderElectionDriver{configMapName='job-name-00000000000000000000000000000000-jobmanager-leader'}. 2022-11-25T07:50:48.263575979Z INFO org.apache.flink.kubernetes.kubeclient.resources.KubernetesConfigMapSharedInformer [] - Starting to watch for job-name/job-name-00000000000000000000000000000000-jobmanager-leader, watching id:5a603da1-f5a4-423e-a326-826f3db3e5c9 2022-11-25T07:50:48.263903194Z INFO org.apache.flink.client.ClientUtils [] - Starting program (detached: true) 2022-11-25T07:50:48.276250367Z INFO org.apache.flink.contrib.streaming.state.EmbeddedRocksDBStateBackend [] - Using predefined options: DEFAULT. 2022-11-25T07:50:48.276346433Z INFO org.apache.flink.contrib.streaming.state.EmbeddedRocksDBStateBackend [] - Using default options factory: DefaultConfigurableOptionsFactory{configuredOptions={state.backend.rocksdb.thread.num=4}}. 2022-11-25T07:50:48.282972740Z INFO org.apache.flink.kubernetes.kubeclient.resources.KubernetesLeaderElector [] - New leader elected 01e96e8e-11c9-4609-85fc-2add4f7f20a9 for job-name-00000000000000000000000000000000-jobmanager-leader. 2022-11-25T07:50:48.351177868Z WARN org.apache.flink.connector.kafka.sink.KafkaSinkBuilder [] - Property [transaction.timeout.ms] not specified. Setting it to PT1H 2022-11-25T07:50:48.357515666Z INFO org.apache.flink.runtime.rpc.akka.AkkaRpcService [] - Starting RPC endpoint for org.apache.flink.runtime.jobmaster.JobMaster at akka://flink/user/rpc/jobmanager_4 . 2022-11-25T07:50:48.357840774Z INFO org.apache.flink.runtime.jobmaster.JobMaster [] - Initializing job 'job-name' (00000000000000000000000000000000). 2022-11-25T07:50:48.449406328Z INFO org.apache.flink.runtime.jobmaster.JobMaster [] - Using restart back off time strategy FixedDelayRestartBackoffTimeStrategy(maxNumberRestartAttempts=2147483647, backoffTimeMS=1000) for job-name (00000000000000000000000000000000). 2022-11-25T07:50:48.449691043Z INFO org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStoreUtils [] - Recovering checkpoints from KubernetesStateHandleStore{configMapName='job-name-00000000000000000000000000000000-jobmanager-leader'}. 2022-11-25T07:50:48.454662598Z INFO org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStoreUtils [] - Found 2 checkpoints in KubernetesStateHandleStore{configMapName='job-name-00000000000000000000000000000000-jobmanager-leader'}. 2022-11-25T07:50:48.454686302Z INFO org.apache.flink.client.deployment.application.executors.EmbeddedExecutor [] - Job 00000000000000000000000000000000 was recovered successfully. 2022-11-25T07:50:48.454693684Z INFO org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStoreUtils [] - Trying to fetch 2 checkpoints from storage. 2022-11-25T07:50:48.454726858Z INFO org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStoreUtils [] - Trying to retrieve checkpoint 11617. 2022-11-25T07:50:48.561472367Z INFO org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStoreUtils [] - Trying to retrieve checkpoint 11618. 2022-11-25T07:50:48.660339241Z INFO org.apache.flink.runtime.jobmaster.JobMaster [] - Running initialization on master for job job-name (00000000000000000000000000000000). 2022-11-25T07:50:48.660371230Z INFO org.apache.flink.runtime.jobmaster.JobMaster [] - Successfully ran initialization on master in 0 ms. 2022-11-25T07:50:48.671673073Z INFO org.apache.flink.runtime.scheduler.adapter.DefaultExecutionTopology [] - Built 1 pipelined regions in 0 ms 2022-11-25T07:50:48.750328281Z INFO org.apache.flink.runtime.jobmaster.JobMaster [] - Using job/cluster config to configure application-defined state backend: EmbeddedRocksDBStateBackend{, localRocksDbDirectories=[/opt/flink/rocksdb], enableIncrementalCheckpointing=TRUE, numberOfTransferThreads=4, writeBatchSize=2097152} 2022-11-25T07:50:48.750467136Z INFO org.apache.flink.contrib.streaming.state.EmbeddedRocksDBStateBackend [] - Using predefined options: DEFAULT. 2022-11-25T07:50:48.750515632Z INFO org.apache.flink.contrib.streaming.state.EmbeddedRocksDBStateBackend [] - Using application-defined options factory: DefaultConfigurableOptionsFactory{configuredOptions={state.backend.rocksdb.thread.num=4}}. 2022-11-25T07:50:48.750552578Z INFO org.apache.flink.runtime.jobmaster.JobMaster [] - Using application-defined state backend: EmbeddedRocksDBStateBackend{, localRocksDbDirectories=[/opt/flink/rocksdb], enableIncrementalCheckpointing=TRUE, numberOfTransferThreads=4, writeBatchSize=2097152} 2022-11-25T07:50:48.750563305Z INFO org.apache.flink.runtime.state.StateBackendLoader [] - State backend loader loads the state backend as EmbeddedRocksDBStateBackend 2022-11-25T07:50:48.751075656Z INFO org.apache.flink.runtime.jobmaster.JobMaster [] - Using job/cluster config to configure application-defined checkpoint storage: org.apache.flink.runtime.state.storage.FileSystemCheckpointStorage@3f46669b 2022-11-25T07:50:48.753978414Z INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Restoring job 00000000000000000000000000000000 from Checkpoint 11618 @ 1669361519869 for 00000000000000000000000000000000 located at s3p://flink-checkpoints/k8s-checkpoint-job-name/00000000000000000000000000000000/chk-11618. ________________________________ "This message contains confidential information/commercial secret. If you are not the intended addressee of this message you may not copy, save, print or forward it to any third party and you are kindly requested to destroy this message and notify the sender thereof by email. Данное сообщение содержит конфиденциальную информацию/информацию, являющуюся коммерческой тайной. Если Вы не являетесь надлежащим адресатом данного сообщения, Вы не вправе копировать, сохранять, печатать или пересылать его каким либо иным лицам. Просьба уничтожить данное сообщение и уведомить об этом отправителя электронным письмом."