如题,这个问题之前遇到过,当时我email问的是集群不断重启。
这次也是这个问题,集群不断重启,但分析下原因如题。看日志片段如下:
2021-11-01 14:05:36,954 INFO [78-cluster-io-thread-1]
org.apache.flink.runtime.jobmanager.DefaultJobGraphStore.recoverJobGraph(DefaultJobGraphStore.java:181)
- Recovered JobGraph(jobId: dfced635fd8c224222a9cbaaf1c5054f).
2021-11-01 14:05:36,954 INFO [78-cluster-io-thread-1]
org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.recoverJobs(SessionDispatcherLeaderProcess.java:125)
- Successfully recovered 1 persisted job graphs.
2021-11-01 14:05:36,962 INFO [78-cluster-io-thread-1]
org.apache.flink.runtime.rpc.akka.AkkaRpcService.startServer(AkkaRpcService.java:232)
- Starting RPC endpoint for
org.apache.flink.runtime.dispatcher.StandaloneDispatcher at
akka://flink/user/rpc/dispatcher_1 .
2021-11-01 14:05:44,810 INFO [94-flink-akka.actor.default-dispatcher-30]
org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService.start(DefaultLeaderElectionService.java:93)
- Starting DefaultLeaderElectionService with
ZooKeeperLeaderElectionDriver{leaderPath='/leader/dfced635fd8c224222a9cbaaf1c5054f/job_manager_lock'}.
2021-11-01 14:05:44,836 ERROR [94-flink-akka.actor.default-dispatcher-30]
org.apache.flink.runtime.entrypoint.ClusterEntrypoint.onFatalError(ClusterEntrypoint.java:454)
- Fatal error occurred in the cluster entrypoint.
org.apache.flink.util.FlinkException: JobMaster for job
dfced635fd8c224222a9cbaaf1c5054f failed.
at
org.apache.flink.runtime.dispatcher.Dispatcher.jobMasterFailed(Dispatcher.java:873)
~[flink-dist_2.11-1.13.0.1-sc-SNAPSHOT.jar:1.13.0.1-sc-SNAPSHOT]
如上,恢复了jobgraph,开启 leader 选举(看起来像是jobmaster的leader选举服务),然后jobmaster 挂了。
如上,我想知道为什么jobmaster挂了就会导致 standalone JM 进程失败呢?
JM进程是所有任务公用,即使启动后之前的某个job无法恢复,也没必要因此就挂掉吧。