Oopss... False joy. In fact, it does start another container, but this container ends immediately because the job is not submitted to that container but to the streaming one.
Log details: Command = # JVM_ARGS = -DCluster.Parallelisme=150 -Drecovery.mode=standalone /usr/lib/flink/bin/flink run -m yarn-cluster -yn 48 -ytm 5120 -yqu batch1 -ys 4 --class com.bouygtel.kubera.main.segstage.MainGeoSegStage /home/voyager/KBR/GOS/lib/KUBERA-GEO-SOURCE-0.0.1-SNAPSHOT-allinone.jar -j /home/voyager/KBR/GOS/log -c /home/voyager/KBR/GOS/cfg/KBR_GOS_Config.cfg Log = Found YARN properties file /tmp/.yarn-properties-voyager YARN properties set default parallelism to 24 Using JobManager address from YARN properties bt1shli3.bpa.bouyguestelecom.fr/172.21.125.28:36700 YARN cluster mode detected. Switching Log4j output to console 11:39:18,192 INFO org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://h1r1dn02.bpa.bouyguestelecom.fr:8188/ws/v1/timeline/ 11:39:18,349 INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at h1r1nn01.bpa.bouyguestelecom.fr/172.21.125.3:8050 11:39:18,504 INFO org.apache.flink.client.FlinkYarnSessionCli - No path for the flink jar passed. Using the location of class org.apache.flink.yarn.FlinkYarnClient to locate the jar 11:39:18,513 INFO org.apache.flink.yarn.FlinkYarnClient - Using values: 11:39:18,515 INFO org.apache.flink.yarn.FlinkYarnClient - TaskManager count = 48 11:39:18,515 INFO org.apache.flink.yarn.FlinkYarnClient - JobManager memory = 1024 11:39:18,515 INFO org.apache.flink.yarn.FlinkYarnClient - TaskManager memory = 5120 11:39:18,641 WARN org.apache.flink.yarn.FlinkYarnClient - The JobManager or TaskManager memory is below the smallest possible YARN Container size. The value of 'yarn.scheduler.minimum-allocation-mb' is '2048'. Please increase the memory size.YARN will allocate the smaller containers but the scheduler will account for the minimum-allocation-mb, maybe not all instances you requested will start. 11:39:19,102 INFO org.apache.flink.yarn.Utils - Copying from file:/usr/lib/flink/lib/flink-dist_2.11-0.10.0.jar to hdfs://h1r1nn01.bpa.bouyguestelecom.fr:8020/user/voyager/.flink/application_1449127732314_0046/flink-dist_2.11-0.10.0.jar 11:39:19,653 INFO org.apache.flink.yarn.Utils - Copying from /usr/lib/flink/conf/flink-conf.yaml to hdfs://h1r1nn01.bpa.bouyguestelecom.fr:8020/user/voyager/.flink/application_1449127732314_0046/flink-conf.yaml 11:39:19,667 INFO org.apache.flink.yarn.Utils - Copying from file:/usr/lib/flink/conf/logback.xml to hdfs://h1r1nn01.bpa.bouyguestelecom.fr:8020/user/voyager/.flink/application_1449127732314_0046/logback.xml 11:39:19,679 INFO org.apache.flink.yarn.Utils - Copying from file:/usr/lib/flink/conf/log4j.properties to hdfs://h1r1nn01.bpa.bouyguestelecom.fr:8020/user/voyager/.flink/application_1449127732314_0046/log4j.properties 11:39:19,698 INFO org.apache.flink.yarn.FlinkYarnClient - Submitting application master application_1449127732314_0046 11:39:19,723 INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl - Submitted application application_1449127732314_0046 11:39:19,723 INFO org.apache.flink.yarn.FlinkYarnClient - Waiting for the cluster to be allocated 11:39:19,725 INFO org.apache.flink.yarn.FlinkYarnClient - Deploying cluster, current state ACCEPTED 11:39:20,727 INFO org.apache.flink.yarn.FlinkYarnClient - Deploying cluster, current state ACCEPTED 11:39:21,728 INFO org.apache.flink.yarn.FlinkYarnClient - Deploying cluster, current state ACCEPTED 11:39:22,730 INFO org.apache.flink.yarn.FlinkYarnClient - Deploying cluster, current state ACCEPTED 11:39:23,731 INFO org.apache.flink.yarn.FlinkYarnClient - YARN application has been deployed successfully. 11:39:23,734 INFO org.apache.flink.yarn.FlinkYarnCluster - Start actor system. 11:39:24,192 INFO org.apache.flink.yarn.FlinkYarnCluster - Start application client. YARN cluster started JobManager web interface address http://h1r1nn01.bpa.bouyguestelecom.fr:8088/proxy/application_1449127732314_0046/ Waiting until all TaskManagers have connected 11:39:24,202 INFO org.apache.flink.yarn.ApplicationClient - Notification about new leader address akka.tcp://flink@172.21.125.16:59907/user/jobmanager with session ID null. No status updates from the YARN cluster received so far. Waiting ... 11:39:24,206 INFO org.apache.flink.yarn.ApplicationClient - Received address of new leader akka.tcp://flink@172.21.125.16:59907/user/jobmanager with session ID null. 11:39:24,206 INFO org.apache.flink.yarn.ApplicationClient - Disconnect from JobManager null. 11:39:24,210 INFO org.apache.flink.yarn.ApplicationClient - Trying to register at JobManager akka.tcp://flink@172.21.125.16:59907/user/jobmanager. 11:39:24,377 INFO org.apache.flink.yarn.ApplicationClient - Successfully registered at the JobManager Actor[akka.tcp://flink@172.21.125.16:59907/user/jobmanager#-801507205] TaskManager status (0/48) TaskManager status (0/48) TaskManager status (0/48) TaskManager status (0/48) TaskManager status (0/48) TaskManager status (0/48) TaskManager status (0/48) TaskManager status (0/48) TaskManager status (0/48) TaskManager status (0/48) TaskManager status (0/48) TaskManager status (0/48) TaskManager status (0/48) TaskManager status (0/48) TaskManager status (0/48) TaskManager status (0/48) TaskManager status (12/48) TaskManager status (12/48) TaskManager status (12/48) TaskManager status (12/48) TaskManager status (46/48) TaskManager status (46/48) TaskManager status (46/48) TaskManager status (46/48) All TaskManagers are connected Using the parallelism provided by the remote cluster (192). To use another parallelism, set it at the ./bin/flink client. 12/03/2015 11:39:55 Job execution switched to status RUNNING. 12/03/2015 11:39:55 CHAIN DataSource (at createInput(ExecutionEnvironment.java:508) (com.bouygtel.kuberasdk.hive.HiveHCatDAO$1)) -> FlatMap (FlatMap at readTable(HiveHCatDAO.java:120)) -> Map (Key Extractor 1)(1/150) switched to SCHEDULED 12/03/2015 11:39:55 CHAIN DataSource (at createInput(ExecutionEnvironment.java:508) (com.bouygtel.kuberasdk.hive.HiveHCatDAO$1)) -> FlatMap (FlatMap at readTable(HiveHCatDAO.java:120)) -> Map (Key Extractor 1)(1/150) switched to DEPLOYING => The job starts Then it crashes : org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: Not enough free slots available to run the job. You can decrease the operator parallelism or increase the number of slots per TaskManager in the configuration. Task to schedule: < Attempt #0 (CHAIN DataSource (at createInput(ExecutionEnvironment.java:508) (com.bouygtel.kuberasdk.hive.HiveHCatDAO$1)) -> FlatMap (FlatMap at readTable(HiveHCatDAO.java:120)) -> Map (Key Extractor 1) (5/150)) @ (unassigned) - [SCHEDULED] > with groupID < 7b9e554a93d3ea946d13d239a99bb6ae > in sharing group < SlotSharingGroup [0c9285747d113d8dd85962602b674497, 9f30db9a30430385e1cd9d0f5010ed9e, 36b825566212059be3f888e3bbdf0d96, f95ba68c3916346efe497b937393eb49, e73522cce11e699022c285180fd1024d, 988b776310ef3d8a2a3875227008a30e, 7b9e554a93d3ea946d13d239a99bb6ae, 08af3a01b9cb49b76e6aedcd57d57788, 3f91660c6ab25f0f77d8e55d54397b01] >. Resources available to scheduler: Number of instances=6, total number of slots=24, available slots=0 Stating that I have only 24 slots on my 48 container cluster ! -----Message d'origine----- De : LINZ, Arnaud Envoyé : jeudi 3 décembre 2015 11:26 À : user@flink.apache.org Objet : RE: HA Mode and standalone containers compatibility ? Hi, The batch job does not need to be HA. I stopped everything, cleaned the temp files, added -Drecovery.mode=standalone and it seems to work now ! Strange, but good for me for now. Thanks, Arnaud -----Message d'origine----- De : Ufuk Celebi [mailto:u...@apache.org] Envoyé : jeudi 3 décembre 2015 11:11 À : user@flink.apache.org Objet : Re: HA Mode and standalone containers compatibility ? Hey Arnaud, thanks for reporting this. I think Till’s suggestion will help to debug this (checking whether a second YARN application has been started)… You don’t want to run the batch application in HA mode, correct? I sounds like the batch job is submitted with the same config keys. Could you start the batch job explicitly with -Drecovery.mode=standalone? If you do want the batch job to be HA as well, you have to configure separate Zookeeper root paths: recovery.zookeeper.path.root: /flink-streaming-1 # for the streaming session recovery.zookeeper.path.root: /flink-batch # for the batch session – Ufuk > On 03 Dec 2015, at 11:01, LINZ, Arnaud <al...@bouyguestelecom.fr> wrote: > > Yes, it does interfere, I do have additional task managers. My batch > application comes in my streaming cluster Flink’s GUI instead of creating its > own container with its own GUI despite the –m yarn-cluster option. > > De : Till Rohrmann [mailto:trohrm...@apache.org] Envoyé : jeudi 3 > décembre 2015 10:36 À : user@flink.apache.org Objet : Re: HA Mode and > standalone containers compatibility ? > > Hi Arnaud, > > as long as you don't have HA activated for your batch jobs, HA shouldn't have > an influence on the batch execution. If it interferes, then you should see > additional task manager connected to the streaming cluster when you execute > the batch job. Could you check that? Furthermore, could you check that > actually a second yarn application is started when you run the batch jobs? > > Cheers, > Till > > On Thu, Dec 3, 2015 at 9:57 AM, LINZ, Arnaud <al...@bouyguestelecom.fr> wrote: > Hello, > > > > I have both streaming applications & batch applications. Since the memory > needs are not the same, I was using a long-living container for my streaming > apps and new short-lived containers for hosting each batch execution. > > > > For that, I submit streaming jobs with "flink run" and batch jobs with > "flink run -m yarn-cluster" > > > > This was working fine until I turned zookeeper HA mode on for my streaming > applications. > > Even if I don't set it up in the yaml flink configuration file, but with -D > options on the yarn_session.sh command line, now my batch jobs try to run in > the streaming container, and fails because of the lack of ressources. > > > > My HA options are : > > -Dyarn.application-attempts=10 -Drecovery.mode=zookeeper > -Drecovery.zookeeper.quorum=h1r1en01:2181 > -Drecovery.zookeeper.path.root=/flink -Dstate.backend=filesystem > -Dstate.backend.fs.checkpointdir=hdfs:///tmp/flink/checkpoints > -Drecovery.zookeeper.storageDir=hdfs:///tmp/flink/recovery/ > > > > Am I missing something ? > > > > Best regards, > > Aranud > > > > L'intégrité de ce message n'étant pas assurée sur internet, la société > expéditrice ne peut être tenue responsable de son contenu ni de ses pièces > jointes. Toute utilisation ou diffusion non autorisée est interdite. Si vous > n'êtes pas destinataire de ce message, merci de le détruire et d'avertir > l'expéditeur. > > The integrity of this message cannot be guaranteed on the Internet. The > company that sent this message cannot therefore be held liable for its > content nor attachments. Any unauthorized use or dissemination is prohibited. > If you are not the intended recipient of this message, then please delete it > and notify the sender.