RE: HA Mode and standalone containers compatibility ?

LINZ, Arnaud Thu, 03 Dec 2015 02:49:18 -0800

Oopss... False joy. 

In fact, it does start another container, but this container ends immediately 
because the job is not submitted to that container but to the streaming one.


Log details: 

Command = 
#  JVM_ARGS =  -DCluster.Parallelisme=150  -Drecovery.mode=standalone
/usr/lib/flink/bin/flink run -m yarn-cluster -yn 48 -ytm 5120 -yqu batch1 -ys 4 
--class com.bouygtel.kubera.main.segstage.MainGeoSegStage 
/home/voyager/KBR/GOS/lib/KUBERA-GEO-SOURCE-0.0.1-SNAPSHOT-allinone.jar  -j 
/home/voyager/KBR/GOS/log -c /home/voyager/KBR/GOS/cfg/KBR_GOS_Config.cfg 

Log = 
Found YARN properties file /tmp/.yarn-properties-voyager
YARN properties set default parallelism to 24
Using JobManager address from YARN properties 
bt1shli3.bpa.bouyguestelecom.fr/172.21.125.28:36700
YARN cluster mode detected. Switching Log4j output to console
11:39:18,192 INFO  org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl    
 - Timeline service address: 
http://h1r1dn02.bpa.bouyguestelecom.fr:8188/ws/v1/timeline/
11:39:18,349 INFO  org.apache.hadoop.yarn.client.RMProxy                        
 - Connecting to ResourceManager at 
h1r1nn01.bpa.bouyguestelecom.fr/172.21.125.3:8050
11:39:18,504 INFO  org.apache.flink.client.FlinkYarnSessionCli                  
 - No path for the flink jar passed. Using the location of class 
org.apache.flink.yarn.FlinkYarnClient to locate the jar
11:39:18,513 INFO  org.apache.flink.yarn.FlinkYarnClient                        
 - Using values:
11:39:18,515 INFO  org.apache.flink.yarn.FlinkYarnClient                        
 -   TaskManager count = 48
11:39:18,515 INFO  org.apache.flink.yarn.FlinkYarnClient                        
 -   JobManager memory = 1024
11:39:18,515 INFO  org.apache.flink.yarn.FlinkYarnClient                        
 -   TaskManager memory = 5120
11:39:18,641 WARN  org.apache.flink.yarn.FlinkYarnClient                        
 - The JobManager or TaskManager memory is below the smallest possible YARN 
Container size. The value of 'yarn.scheduler.minimum-allocation-mb' is '2048'. 
Please increase the memory size.YARN will allocate the smaller containers but 
the scheduler will account for the minimum-allocation-mb, maybe not all 
instances you requested will start.
11:39:19,102 INFO  org.apache.flink.yarn.Utils                                  
 - Copying from file:/usr/lib/flink/lib/flink-dist_2.11-0.10.0.jar to 
hdfs://h1r1nn01.bpa.bouyguestelecom.fr:8020/user/voyager/.flink/application_1449127732314_0046/flink-dist_2.11-0.10.0.jar
11:39:19,653 INFO  org.apache.flink.yarn.Utils                                  
 - Copying from /usr/lib/flink/conf/flink-conf.yaml to 
hdfs://h1r1nn01.bpa.bouyguestelecom.fr:8020/user/voyager/.flink/application_1449127732314_0046/flink-conf.yaml
11:39:19,667 INFO  org.apache.flink.yarn.Utils                                  
 - Copying from file:/usr/lib/flink/conf/logback.xml to 
hdfs://h1r1nn01.bpa.bouyguestelecom.fr:8020/user/voyager/.flink/application_1449127732314_0046/logback.xml
11:39:19,679 INFO  org.apache.flink.yarn.Utils                                  
 - Copying from file:/usr/lib/flink/conf/log4j.properties to 
hdfs://h1r1nn01.bpa.bouyguestelecom.fr:8020/user/voyager/.flink/application_1449127732314_0046/log4j.properties
11:39:19,698 INFO  org.apache.flink.yarn.FlinkYarnClient                        
 - Submitting application master application_1449127732314_0046
11:39:19,723 INFO  org.apache.hadoop.yarn.client.api.impl.YarnClientImpl        
 - Submitted application application_1449127732314_0046
11:39:19,723 INFO  org.apache.flink.yarn.FlinkYarnClient                        
 - Waiting for the cluster to be allocated
11:39:19,725 INFO  org.apache.flink.yarn.FlinkYarnClient                        
 - Deploying cluster, current state ACCEPTED
11:39:20,727 INFO  org.apache.flink.yarn.FlinkYarnClient                        
 - Deploying cluster, current state ACCEPTED
11:39:21,728 INFO  org.apache.flink.yarn.FlinkYarnClient                        
 - Deploying cluster, current state ACCEPTED
11:39:22,730 INFO  org.apache.flink.yarn.FlinkYarnClient                        
 - Deploying cluster, current state ACCEPTED
11:39:23,731 INFO  org.apache.flink.yarn.FlinkYarnClient                        
 - YARN application has been deployed successfully.
11:39:23,734 INFO  org.apache.flink.yarn.FlinkYarnCluster                       
 - Start actor system.
11:39:24,192 INFO  org.apache.flink.yarn.FlinkYarnCluster                       
 - Start application client.
YARN cluster started
JobManager web interface address 
http://h1r1nn01.bpa.bouyguestelecom.fr:8088/proxy/application_1449127732314_0046/
Waiting until all TaskManagers have connected
11:39:24,202 INFO  org.apache.flink.yarn.ApplicationClient                      
 - Notification about new leader address 
akka.tcp://flink@172.21.125.16:59907/user/jobmanager with session ID null.
No status updates from the YARN cluster received so far. Waiting ...
11:39:24,206 INFO  org.apache.flink.yarn.ApplicationClient                      
 - Received address of new leader 
akka.tcp://flink@172.21.125.16:59907/user/jobmanager with session ID null.
11:39:24,206 INFO  org.apache.flink.yarn.ApplicationClient                      
 - Disconnect from JobManager null.
11:39:24,210 INFO  org.apache.flink.yarn.ApplicationClient                      
 - Trying to register at JobManager 
akka.tcp://flink@172.21.125.16:59907/user/jobmanager.
11:39:24,377 INFO  org.apache.flink.yarn.ApplicationClient                      
 - Successfully registered at the JobManager 
Actor[akka.tcp://flink@172.21.125.16:59907/user/jobmanager#-801507205]
TaskManager status (0/48)
TaskManager status (0/48)
TaskManager status (0/48)
TaskManager status (0/48)
TaskManager status (0/48)
TaskManager status (0/48)
TaskManager status (0/48)
TaskManager status (0/48)
TaskManager status (0/48)
TaskManager status (0/48)
TaskManager status (0/48)
TaskManager status (0/48)
TaskManager status (0/48)
TaskManager status (0/48)
TaskManager status (0/48)
TaskManager status (0/48)
TaskManager status (12/48)
TaskManager status (12/48)
TaskManager status (12/48)
TaskManager status (12/48)
TaskManager status (46/48)
TaskManager status (46/48)
TaskManager status (46/48)
TaskManager status (46/48)
All TaskManagers are connected
Using the parallelism provided by the remote cluster (192). To use another 
parallelism, set it at the ./bin/flink client.
12/03/2015 11:39:55  Job execution switched to status RUNNING.
12/03/2015 11:39:55  CHAIN DataSource (at 
createInput(ExecutionEnvironment.java:508) 
(com.bouygtel.kuberasdk.hive.HiveHCatDAO$1)) -> FlatMap (FlatMap at 
readTable(HiveHCatDAO.java:120)) -> Map (Key Extractor 1)(1/150) switched to 
SCHEDULED 
12/03/2015 11:39:55  CHAIN DataSource (at 
createInput(ExecutionEnvironment.java:508) 
(com.bouygtel.kuberasdk.hive.HiveHCatDAO$1)) -> FlatMap (FlatMap at 
readTable(HiveHCatDAO.java:120)) -> Map (Key Extractor 1)(1/150) switched to 
DEPLOYING
=> The job starts

Then it crashes :

org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: Not 
enough free slots available to run the job. You can decrease the operator 
parallelism or increase the number of slots per TaskManager in the 
configuration. Task to schedule: < Attempt #0 (CHAIN DataSource (at 
createInput(ExecutionEnvironment.java:508) 
(com.bouygtel.kuberasdk.hive.HiveHCatDAO$1)) -> FlatMap (FlatMap at 
readTable(HiveHCatDAO.java:120)) -> Map (Key Extractor 1) (5/150)) @ 
(unassigned) - [SCHEDULED] > with groupID < 7b9e554a93d3ea946d13d239a99bb6ae > 
in sharing group < SlotSharingGroup [0c9285747d113d8dd85962602b674497, 
9f30db9a30430385e1cd9d0f5010ed9e, 36b825566212059be3f888e3bbdf0d96, 
f95ba68c3916346efe497b937393eb49, e73522cce11e699022c285180fd1024d, 
988b776310ef3d8a2a3875227008a30e, 7b9e554a93d3ea946d13d239a99bb6ae, 
08af3a01b9cb49b76e6aedcd57d57788, 3f91660c6ab25f0f77d8e55d54397b01] >. 
Resources available to scheduler: Number of instances=6, total number of 
slots=24, available slots=0

Stating that I have only 24 slots on my 48 container cluster !




-----Message d'origine-----
De : LINZ, Arnaud 
Envoyé : jeudi 3 décembre 2015 11:26
À : user@flink.apache.org
Objet : RE: HA Mode and standalone containers compatibility ?

Hi,

The batch job does not need to be HA.
I stopped everything, cleaned the temp files, added -Drecovery.mode=standalone 
and it seems to work now !
Strange, but good for me for now.

Thanks,
Arnaud

-----Message d'origine-----
De : Ufuk Celebi [mailto:u...@apache.org] Envoyé : jeudi 3 décembre 2015 11:11 
À : user@flink.apache.org Objet : Re: HA Mode and standalone containers 
compatibility ?

Hey Arnaud,

thanks for reporting this. I think Till’s suggestion will help to debug this 
(checking whether a second YARN application has been started)…

You don’t want to run the batch application in HA mode, correct?

I sounds like the batch job is submitted with the same config keys. Could you 
start the batch job explicitly with -Drecovery.mode=standalone?

If you do want the batch job to be HA as well, you have to configure separate 
Zookeeper root paths:

recovery.zookeeper.path.root: /flink-streaming-1 # for the streaming session

recovery.zookeeper.path.root: /flink-batch # for the batch session

– Ufuk

> On 03 Dec 2015, at 11:01, LINZ, Arnaud <al...@bouyguestelecom.fr> wrote:
> 
> Yes, it does interfere, I do have additional task managers. My batch 
> application comes in my streaming cluster Flink’s GUI instead of creating its 
> own container with its own GUI despite the –m yarn-cluster option.
>  
> De : Till Rohrmann [mailto:trohrm...@apache.org] Envoyé : jeudi 3 
> décembre 2015 10:36 À : user@flink.apache.org Objet : Re: HA Mode and 
> standalone containers compatibility ?
>  
> Hi Arnaud,
>  
> as long as you don't have HA activated for your batch jobs, HA shouldn't have 
> an influence on the batch execution. If it interferes, then you should see 
> additional task manager connected to the streaming cluster when you execute 
> the batch job. Could you check that? Furthermore, could you check that 
> actually a second yarn application is started when you run the batch jobs?
>  
> Cheers,
> Till
>  
> On Thu, Dec 3, 2015 at 9:57 AM, LINZ, Arnaud <al...@bouyguestelecom.fr> wrote:
> Hello,
> 
>  
> 
> I have both streaming applications & batch applications. Since the memory 
> needs are not the same, I was using a long-living container for my streaming 
> apps and new short-lived containers for hosting each batch execution.
> 
>  
> 
> For that, I submit streaming jobs with "flink run"  and batch jobs with 
> "flink run -m yarn-cluster"
> 
>  
> 
> This was working fine until I turned zookeeper HA mode on for my streaming 
> applications.
> 
> Even if I don't set it up in the yaml flink configuration file, but with -D 
> options on the yarn_session.sh command line, now my batch jobs try to run in 
> the streaming container, and fails because of the lack of ressources.
> 
>  
> 
> My HA options are :
> 
> -Dyarn.application-attempts=10 -Drecovery.mode=zookeeper
> -Drecovery.zookeeper.quorum=h1r1en01:2181
> -Drecovery.zookeeper.path.root=/flink  -Dstate.backend=filesystem 
> -Dstate.backend.fs.checkpointdir=hdfs:///tmp/flink/checkpoints
> -Drecovery.zookeeper.storageDir=hdfs:///tmp/flink/recovery/
> 
>  
> 
> Am I missing something ?
> 
>  
> 
> Best regards,
> 
> Aranud
> 
>  
> 
> L'intégrité de ce message n'étant pas assurée sur internet, la société 
> expéditrice ne peut être tenue responsable de son contenu ni de ses pièces 
> jointes. Toute utilisation ou diffusion non autorisée est interdite. Si vous 
> n'êtes pas destinataire de ce message, merci de le détruire et d'avertir 
> l'expéditeur.
> 
> The integrity of this message cannot be guaranteed on the Internet. The 
> company that sent this message cannot therefore be held liable for its 
> content nor attachments. Any unauthorized use or dissemination is prohibited. 
> If you are not the intended recipient of this message, then please delete it 
> and notify the sender.

RE: HA Mode and standalone containers compatibility ?

Reply via email to