If your Kafka topic has 4 partitions , and if you specify 4 Receivers, messages from each partitions are received by a dedicated receiver. so your receiving parallelism is defined by your number of partitions of your topic . Every receiver task will be scheduled evenly among nodes in your cluster. There was a JIRA fixed in spark 1.5 which does even distribution of receivers.
Now for RDD parallelism ( i.e parallelism while processing your RDD ) is controlled by your Block Interval and Batch Interval. If your block Interval is 200 Ms, there will be 5 blocks per second. If your Batch Interval is 3 seconds, there will 15 blocks per batch. And every Batch is one RDD , thus your RDD will be 15 partition , which will be honored during processing the RDD .. Regards, Dibyendu On Fri, Oct 2, 2015 at 9:40 PM, <nib...@free.fr> wrote: > Ok so if I set for example 4 receivers (number of nodes), how RDD will be > distributed over the nodes/core. > For example in my example I have 4 nodes (with 2 cores) > > Tks > Nicolas > > > ----- Mail original ----- > De: "Dibyendu Bhattacharya" <dibyendu.bhattach...@gmail.com> > À: nib...@free.fr > Cc: "Cody Koeninger" <c...@koeninger.org>, "user" <user@spark.apache.org> > Envoyé: Vendredi 2 Octobre 2015 18:01:59 > Objet: Re: Spark Streaming over YARN > > > Hi, > > > If you need to use Receiver based approach , you can try this one : > https://github.com/dibbhatt/kafka-spark-consumer > > > This is also part of Spark packages : > http://spark-packages.org/package/dibbhatt/kafka-spark-consumer > > > You just need to specify the number of Receivers you want for desired > parallelism while receiving , and rest of the thing will be taken care by > ReceiverLauncher. > > > This Low level Receiver will give better parallelism both on receiving , > and on processing the RDD. > > > Default Receiver based API ( KafkaUtils.createStream) using Kafka High > level API and Kafka high Level API has serious issue to be used in > production . > > > > > Regards, > > Dibyendu > > > > > > > > > > > On Fri, Oct 2, 2015 at 9:22 PM, < nib...@free.fr > wrote: > > > From my understanding as soon as I use YARN I don't need to use > parrallelisme (at least for RDD treatment) > I don't want to use direct stream as I have to manage the offset > positionning (in order to be able to start from the last offset treated > after a spark job failure) > > > ----- Mail original ----- > De: "Cody Koeninger" < c...@koeninger.org > > À: "Nicolas Biau" < nib...@free.fr > > Cc: "user" < user@spark.apache.org > > Envoyé: Vendredi 2 Octobre 2015 17:43:41 > Objet: Re: Spark Streaming over YARN > > > > > If you're using the receiver based implementation, and want more > parallelism, you have to create multiple streams and union them together. > > > Or use the direct stream. > > > On Fri, Oct 2, 2015 at 10:40 AM, < nib...@free.fr > wrote: > > > Hello, > I have a job receiving data from kafka (4 partitions) and persisting data > inside MongoDB. > It works fine, but when I deploy it inside YARN cluster (4 nodes with 2 > cores) only on node is receiving all the kafka partitions and only one node > is processing my RDD treatment (foreach function) > How can I force YARN to use all the resources nodes and cores to process > the data (receiver & RDD treatment) > > Tks a lot > Nicolas > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > >