[jira] [Commented] (FLINK-8580) No easy way (or issues when trying?) to handle multiple yarn sessions and choose at runtime the one to submit a ha streaming job

Arnaud Linz (Jira) Fri, 14 Aug 2020 04:25:48 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-8580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17177709#comment-17177709
 ]


Arnaud Linz commented on FLINK-8580:
------------------------------------

Actually my cluster is a hadoop (yarn cluster), not standalone clusters. I am 
just trying to launch each streaming application in a separate yarn container.

> No easy way (or issues when trying?) to handle multiple yarn sessions and 
> choose at runtime the one to submit a ha streaming job
> --------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-8580
>                 URL: https://issues.apache.org/jira/browse/FLINK-8580
>             Project: Flink
>          Issue Type: Improvement
>          Components: Command Line Client, Deployment / YARN, Runtime / 
> Coordination
>    Affects Versions: 1.3.2
>            Reporter: Arnaud Linz
>            Priority: Major
>
> Hello,
> I am using Flink 1.3.2 and I’m struggling to achieve something that should be 
> simple.
> For isolation reasons, I want to start multiple long living yarn session 
> containers (with the same user) and choose at run-time, when I start a HA 
> streaming app, which container will hold it.
> I start my yarn session with the command line option : 
> -Dyarn.properties-file.location=mydir
> The session is created and a .yarn-properties-$USER file is generated.
> And I’ve tried the following to submit my job:
>  
> *CASE 1*
> *flink-conf.yaml* : yarn.properties-file.location: mydir
>  *flink run options* : none
>  * Uses zookeeper and works  – but I cannot choose the container as the 
> property file is global.
>  
> *CASE 2*
> *flink-conf.yaml* : nothing
>  *flink run options* : -yid applicationId
>  * Do not use zookeeper, tries to connect to yarn job manager but fails in 
> “Job submission to the JobManager timed out” error
>  
> *CASE 3*
> *flink-conf.yaml* : nothing
>  *flink run options* : -yid applicationId and -yD with all dynamic properties 
> found in the “dynamicPropertiesString” of .yarn-properties $USER file
>  * Same as case 2
>  
> *CASE 4*
> *flink-conf.yaml* : nothing
>  *flink run options* : -yD yarn.properties-file.location=mydir
>  * Tries to connect to local (non yarn) job manager (and fails)
>  
> *CASE 5*
> Even weirder:
> *flink-conf.yaml* : yarn.properties-file.location: mydir
>  *flink run options* : -yD yarn.properties-file.location=mydir
>  * Still tries to connect to local (non yarn) job manager!
>  
> Without any other solution, I've made a shell script that copies the original 
> content of FLINK_CONF_DIR in a temporary dir, modify flink-conf.yaml to set 
> yarn.properties-file.location, and change FLINK_CONF_DIR to that temp dir 
> before executing flink to submit the job.
> I am now able to select the container I want, but I think it should be made 
> simpler…
>  
> Logs extracts:
> *CASE 1:*
> {{2018:02:01 15:43:20 - Waiting until all TaskManagers have 
> connected}}{{2018:02:01 15:43:20 - Starting client actor 
> system.}}{{2018:02:01 15:43:20 - Starting 
> ZooKeeperLeaderRetrievalService.}}{{2018:02:01 15:43:20 - Trying to select 
> the network interface and address to use by connecting to the leading 
> JobManager.}}{{2018:02:01 15:43:20 - TaskManager will try to connect for 
> 10000 milliseconds before falling back to heuristics}}{{2018:02:01 15:43:21 - 
> Retrieved new target address 
> elara-data-u2-n01.dev.mlb.jupiter.nbyt.fr/10.136.170.193:33970.}}{{2018:02:01 
> 15:43:21 - Stopping ZooKeeperLeaderRetrievalService.}}{{2018:02:01 15:43:21 - 
> Slf4jLogger started}}{{2018:02:01 15:43:21 - Starting remoting}}{{2018:02:01 
> 15:43:21 - Remoting started; listening on addresses 
> :[akka.tcp://fl...@elara-edge-u2-n01.dev.mlb.jupiter.nbyt.fr:36340]}}{{2018:02:01
>  15:43:21 - Starting ZooKeeperLeaderRetrievalService.}}{{2018:02:01 15:43:21 
> - Stopping ZooKeeperLeaderRetrievalService.}}{{2018:02:01 15:43:21 - 
> TaskManager status (2/1)}}{{2018:02:01 15:43:21 - All TaskManagers are 
> connected}}{{2018:02:01 15:43:21 - Submitting job with JobID: 
> f69197b0b80a76319a87bde10c1e3f77. Waiting for job completion.}}{{2018:02:01 
> 15:43:21 - Starting ZooKeeperLeaderRetrievalService.}}{{2018:02:01 15:43:21 - 
> Received SubmitJobAndWait(JobGraph(jobId: f69197b0b80a76319a87bde10c1e3f77)) 
> but there is no connection to a JobManager yet.}}{{2018:02:01 15:43:21 - 
> Received job SND-IMP-SIGNAST 
> (f69197b0b80a76319a87bde10c1e3f77).}}{{2018:02:01 15:43:21 - Disconnect from 
> JobManager null.}}{{2018:02:01 15:43:21 - Connect to JobManager 
> Actor[akka.tcp://fl...@elara-data-u2-n01.dev.mlb.jupiter.nbyt.fr:33970/user/jobmanager#-1554418245].}}{{2018:02:01
>  15:43:21 - Connected to JobManager at 
> Actor[akka.tcp://fl...@elara-data-u2-n01.dev.mlb.jupiter.nbyt.fr:33970/user/jobmanager#-1554418245]
>  with leader session id 388af5b8-5555-4923-8ee4-8a4b9bfbb0b9.}}{{2018:02:01 
> 15:43:21 - Sending message to JobManager 
> akka.tcp://fl...@elara-data-u2-n01.dev.mlb.jupiter.nbyt.fr:33970/user/jobmanager
>  to submit job SND-IMP-SIGNAST (f69197b0b80a76319a87bde10c1e3f77) and wait 
> for progress}}{{2018:02:01 15:43:21 - Upload jar files to job manager 
> akka.tcp://fl...@elara-data-u2-n01.dev.mlb.jupiter.nbyt.fr:33970/user/jobmanager.}}{{2018:02:01
>  15:43:21 - Blob client connecting to 
> akka.tcp://fl...@elara-data-u2-n01.dev.mlb.jupiter.nbyt.fr:33970/user/jobmanager}}{{2018:02:01
>  15:43:22 - Submit job to the job manager 
> akka.tcp://fl...@elara-data-u2-n01.dev.mlb.jupiter.nbyt.fr:33970/user/jobmanager.}}{{2018:02:01
>  15:43:22 - Job f69197b0b80a76319a87bde10c1e3f77 was successfully submitted 
> to the JobManager akka://flink/deadLetters.}}{{2018:02:01 15:43:22 - 
> 02/01/2018 15:43:22   Job execution switched to status RUNNING.}}
>  
> *CASE 2:*
> {{2018:02:01 15:48:43 - Waiting until all TaskManagers have 
> connected}}{{2018:02:01 15:48:43 - Starting client actor 
> system.}}{{2018:02:01 15:48:43 - Trying to select the network interface and 
> address to use by connecting to the leading JobManager.}}{{2018:02:01 
> 15:48:43 - TaskManager will try to connect for 10000 milliseconds before 
> falling back to heuristics}}{{2018:02:01 15:48:43 - Retrieved new target 
> address 
> elara-data-u2-n01.dev.mlb.jupiter.nbyt.fr/10.136.170.193:33970.}}{{2018:02:01 
> 15:48:43 - Slf4jLogger started}}{{2018:02:01 15:48:43 - Starting 
> remoting}}{{2018:02:01 15:48:43 - Remoting started; listening on addresses 
> :[akka.tcp://fl...@elara-edge-u2-n01.dev.mlb.jupiter.nbyt.fr:34140]}}{{2018:02:01
>  15:48:43 - TaskManager status (2/1)}}{{2018:02:01 15:48:43 - All 
> TaskManagers are connected}}{{2018:02:01 15:48:43 - Submitting job with 
> JobID: cd3e0e223c57d01d415fe7a6a308576c. Waiting for job 
> completion.}}{{2018:02:01 15:48:43 - Received 
> SubmitJobAndWait(JobGraph(jobId: cd3e0e223c57d01d415fe7a6a308576c)) but there 
> is no connection to a JobManager yet.}}{{2018:02:01 15:48:43 - Received job 
> SND-IMP-SIGNAST (cd3e0e223c57d01d415fe7a6a308576c).}}{{2018:02:01 15:48:43 - 
> Disconnect from JobManager null.}}{{2018:02:01 15:48:43 - Connect to 
> JobManager 
> Actor[akka.tcp://fl...@elara-data-u2-n01.dev.mlb.jupiter.nbyt.fr:33970/user/jobmanager#-1554418245].}}{{2018:02:01
>  15:48:43 - Connected to JobManager at 
> Actor[akka.tcp://fl...@elara-data-u2-n01.dev.mlb.jupiter.nbyt.fr:33970/user/jobmanager#-1554418245]
>  with leader session id 00000000-0000-0000-0000-000000000000.}}{{2018:02:01 
> 15:48:43 - Sending message to JobManager 
> akka.tcp://fl...@elara-data-u2-n01.dev.mlb.jupiter.nbyt.fr:33970/user/jobmanager
>  to submit job SND-IMP-SIGNAST (cd3e0e223c57d01d415fe7a6a308576c) and wait 
> for progress}}{{2018:02:01 15:48:43 - Upload jar files to job manager 
> akka.tcp://fl...@elara-data-u2-n01.dev.mlb.jupiter.nbyt.fr:33970/user/jobmanager.}}{{2018:02:01
>  15:48:43 - Blob client connecting to 
> akka.tcp://fl...@elara-data-u2-n01.dev.mlb.jupiter.nbyt.fr:33970/user/jobmanager}}{{2018:02:01
>  15:48:45 - Submit job to the job manager 
> akka.tcp://fl...@elara-data-u2-n01.dev.mlb.jupiter.nbyt.fr:33970/user/jobmanager.}}{{2018:02:01
>  15:49:45 - Terminate JobClientActor.}}{{2018:02:01 15:49:45 - Disconnect 
> from JobManager 
> Actor[akka.tcp://fl...@elara-data-u2-n01.dev.mlb.jupiter.nbyt.fr:33970/user/jobmanager#-1554418245].}}
>  
> Then
> {{Caused by: 
> org.apache.flink.runtime.client.JobClientActorSubmissionTimeoutException: Job 
> submission to the JobManager timed out. You may increase 
> 'akka.client.timeout' in case the JobManager needs more time to configure and 
> confirm the job submission.}}{{        at 
> org.apache.flink.runtime.client.JobSubmissionClientActor.handleCustomMessage(JobSubmissionClientActor.java:119)}}{{
>         at 
> org.apache.flink.runtime.client.JobClientActor.handleMessage(JobClientActor.java:251)}}{{
>         at 
> org.apache.flink.runtime.akka.FlinkUntypedActor.handleLeaderSessionID(FlinkUntypedActor.java:89)}}{{
>         at 
> org.apache.flink.runtime.akka.FlinkUntypedActor.onReceive(FlinkUntypedActor.java:68)}}{{
>         at 
> akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:167)}}{{
>         at akka.actor.Actor$class.aroundReceive(Actor.scala:467)}}{{        
> at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:97)}}{{        at 
> akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)}}{{        at 
> akka.actor.ActorCell.invoke(ActorCell.scala:487)}}{{        at 
> akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)}}{{        at 
> akka.dispatch.Mailbox.run(Mailbox.scala:220)}}{{        at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)}}{{
>         at 
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)}}{{      
>   at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)}}{{
>         at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)}}{{  
>       at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)}}
>  
> *CASE 3,4*
> {{ ** }}{{2018:02:01 15:35:14 - Starting client actor system.}}{{2018:02:01 
> 15:35:14 - Trying to select the network interface and address to use by 
> connecting to the leading JobManager.}}{{2018:02:01 15:35:14 - TaskManager 
> will try to connect for 10000 milliseconds before falling back to 
> heuristics}}{{2018:02:01 15:35:14 - Retrieved new target address 
> localhost/127.0.0.1:6123.}}{{2018:02:01 15:35:15 - Trying to connect to 
> address localhost/127.0.0.1:6123}}{{2018:02:01 15:35:15 - Failed to connect 
> from address 'elara-edge-u2-n01/10.136.170.196': Connexion refusée 
> (Connection refused)}}{{2018:02:01 15:35:15 - Failed to connect from address 
> '/127.0.0.1': Connexion refusée (Connection refused)}}{{2018:02:01 15:35:15 - 
> Failed to connect from address '/192.168.117.1': Connexion refusée 
> (Connection refused)}}{{2018:02:01 15:35:15 - Failed to connect from address 
> '/10.136.170.225': Connexion refusée (Connection refused)}}{{2018:02:01 
> 15:35:15 - Failed to connect from address 
> '/fe80:0:0:0:20c:29ff:fe8f:3fdd%ens192': Le réseau n'est pas accessible 
> (connect failed)}}{{2018:02:01 15:35:15 - Failed to connect from address 
> '/10.136.170.196': Connexion refusée (Connection refused)}}{{2018:02:01 
> 15:35:15 - Failed to connect from address '/127.0.0.1': Connexion refusée 
> (Connection refused)}}{{2018:02:01 15:35:15 - Failed to connect from address 
> '/192.168.117.1': Connexion refusée (Connection refused)}}{{2018:02:01 
> 15:35:15 - Failed to connect from address '/10.136.170.225': Connexion 
> refusée (Connection refused)}}{{2018:02:01 15:35:15 - Failed to connect from 
> address '/fe80:0:0:0:20c:29ff:fe8f:3fdd%ens192': Le réseau n'est pas 
> accessible (connect failed)}}{{2018:02:01 15:35:15 - Failed to connect from 
> address '/10.136.170.196': Connexion refusée (Connection 
> refused)}}{{2018:02:01 15:35:15 - Failed to connect from address 
> '/127.0.0.1': Connexion refusée (Connection refused)}}
>  ** 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-8580) No easy way (or issues when trying?) to handle multiple yarn sessions and choose at runtime the one to submit a ha streaming job

Reply via email to