Besides, would you like to participant our survey thread[1] on user list about "How do you use high-availability services in Flink?"
It would help Flink improve its high-availability serving. Best, tison. [1] https://lists.apache.org/x/thread.html/c0cc07197e6ba30b45d7709cc9e17d8497e5e3f33de504d58dfcafad@%3Cuser.flink.apache.org%3E Zili Chen <wander4...@gmail.com> 于2019年8月22日周四 上午10:16写道: > Hi Aleksandar, > > base on your log: > > taskmanager_1 | 2019-08-22 00:05:03,713 INFO > org.apache.flink.runtime.taskexecutor.TaskExecutor - Connecting > to ResourceManager > akka.tcp://flink@jobmanager:6123/user/jobmanager(00000000000000000000000000000000) > . > taskmanager_1 | 2019-08-22 00:05:04,137 INFO > org.apache.flink.runtime.taskexecutor.TaskExecutor - Could not > resolve ResourceManager address > akka.tcp://flink@jobmanager:6123/user/jobmanager, retrying in 10000 ms: > Could not connect to rpc endpoint under address > akka.tcp://flink@jobmanager:6123/user/jobmanager.. > > it looks like you return a jobmanager address on retrieval service of > resource manager. Please check the implementation carefully or share it on > mailing list that others can help for investigation. > > Best, > tison. > > > Zhu Zhu <reed...@gmail.com> 于2019年8月22日周四 上午10:11写道: > >> Hi Aleksandar, >> >> The resource manager address is retrieved from the HA services. >> Would you check whether your customized HA services is returning the >> right LeaderRetrievalService and whether the LeaderRetrievalService is >> really retrieving the right leader's address? >> Or is it possible that the stored resource manager address in HA is >> replaced by jobmanager address in any case? >> >> Thanks, >> Zhu Zhu >> >> Aleksandar Mastilovic <amastilo...@sightmachine.com> 于2019年8月22日周四 >> 上午8:16写道: >> >>> Hi all, >>> >>> I’m experimenting with using my own implementation of HA services >>> instead of ZooKeeper that would persist JobManager information on a >>> Kubernetes volume instead of in ZooKeeper. >>> >>> I’ve set the high-availability option in flink-conf.yaml to the FQN of >>> my factory class, and started the docker ensemble as I usually do (i.e. >>> with no special “cluster” arguments or scripts.) >>> >>> What’s happening now is that TaskManager is unable to connect to >>> ResourceManager, because it seems it’s trying to use the /user/jobmanager >>> path instead of /user/resourcemanager. >>> >>> Here’s what I found in the logs: >>> >>> >>> jobmanager_1 | 2019-08-22 00:05:00,963 INFO akka.remote.Remoting >>> - Remoting started; listening on >>> addresses :[akka.tcp://flink@jobmanager:6123] >>> jobmanager_1 | 2019-08-22 00:05:00,975 INFO >>> org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils - Actor >>> system started at akka.tcp://flink@jobmanager:6123 >>> >>> jobmanager_1 | 2019-08-22 00:05:02,380 INFO >>> org.apache.flink.runtime.rpc.akka.AkkaRpcService - Starting >>> RPC endpoint for >>> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager at >>> akka://flink/user/resourcemanager . >>> >>> jobmanager_1 | 2019-08-22 00:05:03,138 INFO >>> org.apache.flink.runtime.rpc.akka.AkkaRpcService - Starting >>> RPC endpoint for org.apache.flink.runtime.dispatcher.StandaloneDispatcher >>> at akka://flink/user/dispatcher . >>> >>> jobmanager_1 | 2019-08-22 00:05:03,211 INFO >>> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - >>> ResourceManager akka.tcp://flink@jobmanager:6123/user/resourcemanager >>> was granted leadership with fencing token 00000000000000000000000000000000 >>> >>> jobmanager_1 | 2019-08-22 00:05:03,292 INFO >>> org.apache.flink.runtime.dispatcher.StandaloneDispatcher - Dispatcher >>> akka.tcp://flink@jobmanager:6123/user/dispatcher was granted leadership >>> with fencing token 00000000-0000-0000-0000-000000000000 >>> >>> taskmanager_1 | 2019-08-22 00:05:03,713 INFO >>> org.apache.flink.runtime.taskexecutor.TaskExecutor - Connecting >>> to ResourceManager >>> akka.tcp://flink@jobmanager:6123/user/jobmanager(00000000000000000000000000000000) >>> . >>> taskmanager_1 | 2019-08-22 00:05:04,137 INFO >>> org.apache.flink.runtime.taskexecutor.TaskExecutor - Could not >>> resolve ResourceManager address >>> akka.tcp://flink@jobmanager:6123/user/jobmanager, retrying in 10000 ms: >>> Could not connect to rpc endpoint under address >>> akka.tcp://flink@jobmanager:6123/user/jobmanager.. >>> >>> Is this a known bug? I’d appreciate any help I can get. >>> >>> Thanks, >>> Aleksandar Mastilovic >>> >>