Hello, I am using https://github.com/dibbhatt/kafka-spark-consumer I specify 4 receivers in the ReceiverLauncher , but in YARN console I can see one node receiving the kafka flow. (I use spark 1.3.1)
Tks Nicolas ----- Mail original ----- De: "Dibyendu Bhattacharya" <dibyendu.bhattach...@gmail.com> À: nib...@free.fr Cc: "Cody Koeninger" <c...@koeninger.org>, "user" <user@spark.apache.org> Envoyé: Vendredi 2 Octobre 2015 18:21:35 Objet: Re: Spark Streaming over YARN If your Kafka topic has 4 partitions , and if you specify 4 Receivers, messages from each partitions are received by a dedicated receiver. so your receiving parallelism is defined by your number of partitions of your topic . Every receiver task will be scheduled evenly among nodes in your cluster. There was a JIRA fixed in spark 1.5 which does even distribution of receivers. Now for RDD parallelism ( i.e parallelism while processing your RDD ) is controlled by your Block Interval and Batch Interval. If your block Interval is 200 Ms, there will be 5 blocks per second. If your Batch Interval is 3 seconds, there will 15 blocks per batch. And every Batch is one RDD , thus your RDD will be 15 partition , which will be honored during processing the RDD .. Regards, Dibyendu On Fri, Oct 2, 2015 at 9:40 PM, < nib...@free.fr > wrote: Ok so if I set for example 4 receivers (number of nodes), how RDD will be distributed over the nodes/core. For example in my example I have 4 nodes (with 2 cores) Tks Nicolas ----- Mail original ----- De: "Dibyendu Bhattacharya" < dibyendu.bhattach...@gmail.com > À: nib...@free.fr Cc: "Cody Koeninger" < c...@koeninger.org >, "user" < user@spark.apache.org > Envoyé: Vendredi 2 Octobre 2015 18:01:59 Objet: Re: Spark Streaming over YARN Hi, If you need to use Receiver based approach , you can try this one : https://github.com/dibbhatt/kafka-spark-consumer This is also part of Spark packages : http://spark-packages.org/package/dibbhatt/kafka-spark-consumer You just need to specify the number of Receivers you want for desired parallelism while receiving , and rest of the thing will be taken care by ReceiverLauncher. This Low level Receiver will give better parallelism both on receiving , and on processing the RDD. Default Receiver based API ( KafkaUtils.createStream) using Kafka High level API and Kafka high Level API has serious issue to be used in production . Regards, Dibyendu On Fri, Oct 2, 2015 at 9:22 PM, < nib...@free.fr > wrote: >From my understanding as soon as I use YARN I don't need to use parrallelisme >(at least for RDD treatment) I don't want to use direct stream as I have to manage the offset positionning (in order to be able to start from the last offset treated after a spark job failure) ----- Mail original ----- De: "Cody Koeninger" < c...@koeninger.org > À: "Nicolas Biau" < nib...@free.fr > Cc: "user" < user@spark.apache.org > Envoyé: Vendredi 2 Octobre 2015 17:43:41 Objet: Re: Spark Streaming over YARN If you're using the receiver based implementation, and want more parallelism, you have to create multiple streams and union them together. Or use the direct stream. On Fri, Oct 2, 2015 at 10:40 AM, < nib...@free.fr > wrote: Hello, I have a job receiving data from kafka (4 partitions) and persisting data inside MongoDB. It works fine, but when I deploy it inside YARN cluster (4 nodes with 2 cores) only on node is receiving all the kafka partitions and only one node is processing my RDD treatment (foreach function) How can I force YARN to use all the resources nodes and cores to process the data (receiver & RDD treatment) Tks a lot Nicolas --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org