Hi, Without any other solution, I made a shell script that copies the original content of FLINK_CONF_DIR in a temporary rep, modify flink-conf.yaml to set yarn.properties-file.location, and change FLINK_CONF_DIR to that temp rep before executing flink. I am now able to select the container I want, but I think it should be made simpler... I'll open a Jira.
Best regards, Arnaud De : LINZ, Arnaud Envoyé : jeudi 1 février 2018 16:23 À : user@flink.apache.org Objet : How to handle multiple yarn sessions and choose at runtime the one to submit a ha streaming job ? Hello, I am using Flink 1.3.2 and I'm struggling to achieve something that should be simple. For isolation reasons, I want to start multiple long living yarn session containers (with the same user) and choose at run-time, when I start a HA streaming app, which container will hold it. I start my yarn session with the command line option : -Dyarn.properties-file.location=mydir The session is created and a .yarn-properties-$USER file is generated. And I've tried the following to submit my job: CASE 1 flink-conf.yaml : yarn.properties-file.location: mydir flink run options : none * Uses zookeeper and works - but I cannot choose the container as the property file is global. CASE 2 flink-conf.yaml : nothing flink run options : -yid applicationId * Do not use zookeeper, tries to connect to yarn job manager but fails in "Job submission to the JobManager timed out" error CASE 3 flink-conf.yaml : nothing flink run options : -yid applicationId and -yD with all dynamic properties found in the "dynamicPropertiesString" of .yarn-properties-$USER file * Same as case 2 CASE 4 flink-conf.yaml : nothing flink run options : -yD yarn.properties-file.location=mydir * Tries to connect to local (non yarn) job manager (and fails) CASE 5 Even weirder: flink-conf.yaml : yarn.properties-file.location: mydir flink run options : -yD yarn.properties-file.location=mydir * Still tries to connect to local (non yarn) job manager! What am I doing wrong? Logs extracts : CASE 1: 2018:02:01 15:43:20 - Waiting until all TaskManagers have connected 2018:02:01 15:43:20 - Starting client actor system. 2018:02:01 15:43:20 - Starting ZooKeeperLeaderRetrievalService. 2018:02:01 15:43:20 - Trying to select the network interface and address to use by connecting to the leading JobManager. 2018:02:01 15:43:20 - TaskManager will try to connect for 10000 milliseconds before falling back to heuristics 2018:02:01 15:43:21 - Retrieved new target address elara-data-u2-n01.dev.mlb.jupiter.nbyt.fr/10.136.170.193:33970. 2018:02:01 15:43:21 - Stopping ZooKeeperLeaderRetrievalService. 2018:02:01 15:43:21 - Slf4jLogger started 2018:02:01 15:43:21 - Starting remoting 2018:02:01 15:43:21 - Remoting started; listening on addresses :[akka.tcp://fl...@elara-edge-u2-n01.dev.mlb.jupiter.nbyt.fr:36340] 2018:02:01 15:43:21 - Starting ZooKeeperLeaderRetrievalService. 2018:02:01 15:43:21 - Stopping ZooKeeperLeaderRetrievalService. 2018:02:01 15:43:21 - TaskManager status (2/1) 2018:02:01 15:43:21 - All TaskManagers are connected 2018:02:01 15:43:21 - Submitting job with JobID: f69197b0b80a76319a87bde10c1e3f77. Waiting for job completion. 2018:02:01 15:43:21 - Starting ZooKeeperLeaderRetrievalService. 2018:02:01 15:43:21 - Received SubmitJobAndWait(JobGraph(jobId: f69197b0b80a76319a87bde10c1e3f77)) but there is no connection to a JobManager yet. 2018:02:01 15:43:21 - Received job SND-IMP-SIGNAST (f69197b0b80a76319a87bde10c1e3f77). 2018:02:01 15:43:21 - Disconnect from JobManager null. 2018:02:01 15:43:21 - Connect to JobManager Actor[akka.tcp://fl...@elara-data-u2-n01.dev.mlb.jupiter.nbyt.fr:33970/user/jobmanager#-1554418245]. 2018:02:01 15:43:21 - Connected to JobManager at Actor[akka.tcp://fl...@elara-data-u2-n01.dev.mlb.jupiter.nbyt.fr:33970/user/jobmanager#-1554418245] with leader session id 388af5b8-5555-4923-8ee4-8a4b9bfbb0b9. 2018:02:01 15:43:21 - Sending message to JobManager akka.tcp://fl...@elara-data-u2-n01.dev.mlb.jupiter.nbyt.fr:33970/user/jobmanager to submit job SND-IMP-SIGNAST (f69197b0b80a76319a87bde10c1e3f77) and wait for progress 2018:02:01 15:43:21 - Upload jar files to job manager akka.tcp://fl...@elara-data-u2-n01.dev.mlb.jupiter.nbyt.fr:33970/user/jobmanager. 2018:02:01 15:43:21 - Blob client connecting to akka.tcp://fl...@elara-data-u2-n01.dev.mlb.jupiter.nbyt.fr:33970/user/jobmanager 2018:02:01 15:43:22 - Submit job to the job manager akka.tcp://fl...@elara-data-u2-n01.dev.mlb.jupiter.nbyt.fr:33970/user/jobmanager. 2018:02:01 15:43:22 - Job f69197b0b80a76319a87bde10c1e3f77 was successfully submitted to the JobManager akka://flink/deadLetters. 2018:02:01 15:43:22 - 02/01/2018 15:43:22 Job execution switched to status RUNNING. CASE 2: 2018:02:01 15:48:43 - Waiting until all TaskManagers have connected 2018:02:01 15:48:43 - Starting client actor system. 2018:02:01 15:48:43 - Trying to select the network interface and address to use by connecting to the leading JobManager. 2018:02:01 15:48:43 - TaskManager will try to connect for 10000 milliseconds before falling back to heuristics 2018:02:01 15:48:43 - Retrieved new target address elara-data-u2-n01.dev.mlb.jupiter.nbyt.fr/10.136.170.193:33970. 2018:02:01 15:48:43 - Slf4jLogger started 2018:02:01 15:48:43 - Starting remoting 2018:02:01 15:48:43 - Remoting started; listening on addresses :[akka.tcp://fl...@elara-edge-u2-n01.dev.mlb.jupiter.nbyt.fr:34140] 2018:02:01 15:48:43 - TaskManager status (2/1) 2018:02:01 15:48:43 - All TaskManagers are connected 2018:02:01 15:48:43 - Submitting job with JobID: cd3e0e223c57d01d415fe7a6a308576c. Waiting for job completion. 2018:02:01 15:48:43 - Received SubmitJobAndWait(JobGraph(jobId: cd3e0e223c57d01d415fe7a6a308576c)) but there is no connection to a JobManager yet. 2018:02:01 15:48:43 - Received job SND-IMP-SIGNAST (cd3e0e223c57d01d415fe7a6a308576c). 2018:02:01 15:48:43 - Disconnect from JobManager null. 2018:02:01 15:48:43 - Connect to JobManager Actor[akka.tcp://fl...@elara-data-u2-n01.dev.mlb.jupiter.nbyt.fr:33970/user/jobmanager#-1554418245]. 2018:02:01 15:48:43 - Connected to JobManager at Actor[akka.tcp://fl...@elara-data-u2-n01.dev.mlb.jupiter.nbyt.fr:33970/user/jobmanager#-1554418245] with leader session id 00000000-0000-0000-0000-000000000000. 2018:02:01 15:48:43 - Sending message to JobManager akka.tcp://fl...@elara-data-u2-n01.dev.mlb.jupiter.nbyt.fr:33970/user/jobmanager to submit job SND-IMP-SIGNAST (cd3e0e223c57d01d415fe7a6a308576c) and wait for progress 2018:02:01 15:48:43 - Upload jar files to job manager akka.tcp://fl...@elara-data-u2-n01.dev.mlb.jupiter.nbyt.fr:33970/user/jobmanager. 2018:02:01 15:48:43 - Blob client connecting to akka.tcp://fl...@elara-data-u2-n01.dev.mlb.jupiter.nbyt.fr:33970/user/jobmanager 2018:02:01 15:48:45 - Submit job to the job manager akka.tcp://fl...@elara-data-u2-n01.dev.mlb.jupiter.nbyt.fr:33970/user/jobmanager. 2018:02:01 15:49:45 - Terminate JobClientActor. 2018:02:01 15:49:45 - Disconnect from JobManager Actor[akka.tcp://fl...@elara-data-u2-n01.dev.mlb.jupiter.nbyt.fr:33970/user/jobmanager#-1554418245]. Then Caused by: org.apache.flink.runtime.client.JobClientActorSubmissionTimeoutException: Job submission to the JobManager timed out. You may increase 'akka.client.timeout' in case the JobManager needs more time to configure and confirm the job submission. at org.apache.flink.runtime.client.JobSubmissionClientActor.handleCustomMessage(JobSubmissionClientActor.java:119) at org.apache.flink.runtime.client.JobClientActor.handleMessage(JobClientActor.java:251) at org.apache.flink.runtime.akka.FlinkUntypedActor.handleLeaderSessionID(FlinkUntypedActor.java:89) at org.apache.flink.runtime.akka.FlinkUntypedActor.onReceive(FlinkUntypedActor.java:68) at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:167) at akka.actor.Actor$class.aroundReceive(Actor.scala:467) at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:97) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) at akka.actor.ActorCell.invoke(ActorCell.scala:487) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238) at akka.dispatch.Mailbox.run(Mailbox.scala:220) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) CASE 3,4 2018:02:01 15:35:14 - Starting client actor system. 2018:02:01 15:35:14 - Trying to select the network interface and address to use by connecting to the leading JobManager. 2018:02:01 15:35:14 - TaskManager will try to connect for 10000 milliseconds before falling back to heuristics 2018:02:01 15:35:14 - Retrieved new target address localhost/127.0.0.1:6123. 2018:02:01 15:35:15 - Trying to connect to address localhost/127.0.0.1:6123 2018:02:01 15:35:15 - Failed to connect from address 'elara-edge-u2-n01/10.136.170.196': Connexion refusée (Connection refused) 2018:02:01 15:35:15 - Failed to connect from address '/127.0.0.1': Connexion refusée (Connection refused) 2018:02:01 15:35:15 - Failed to connect from address '/192.168.117.1': Connexion refusée (Connection refused) 2018:02:01 15:35:15 - Failed to connect from address '/10.136.170.225': Connexion refusée (Connection refused) 2018:02:01 15:35:15 - Failed to connect from address '/fe80:0:0:0:20c:29ff:fe8f:3fdd%ens192': Le réseau n'est pas accessible (connect failed) 2018:02:01 15:35:15 - Failed to connect from address '/10.136.170.196': Connexion refusée (Connection refused) 2018:02:01 15:35:15 - Failed to connect from address '/127.0.0.1': Connexion refusée (Connection refused) 2018:02:01 15:35:15 - Failed to connect from address '/192.168.117.1': Connexion refusée (Connection refused) 2018:02:01 15:35:15 - Failed to connect from address '/10.136.170.225': Connexion refusée (Connection refused) 2018:02:01 15:35:15 - Failed to connect from address '/fe80:0:0:0:20c:29ff:fe8f:3fdd%ens192': Le réseau n'est pas accessible (connect failed) 2018:02:01 15:35:15 - Failed to connect from address '/10.136.170.196': Connexion refusée (Connection refused) 2018:02:01 15:35:15 - Failed to connect from address '/127.0.0.1': Connexion refusée (Connection refused) ________________________________ L'intégrité de ce message n'étant pas assurée sur internet, la société expéditrice ne peut être tenue responsable de son contenu ni de ses pièces jointes. Toute utilisation ou diffusion non autorisée est interdite. Si vous n'êtes pas destinataire de ce message, merci de le détruire et d'avertir l'expéditeur. The integrity of this message cannot be guaranteed on the Internet. The company that sent this message cannot therefore be held liable for its content nor attachments. Any unauthorized use or dissemination is prohibited. If you are not the intended recipient of this message, then please delete it and notify the sender.