Re: Spark Streaming over YARN

Dibyendu Bhattacharya Fri, 02 Oct 2015 09:23:35 -0700

If your Kafka topic has 4 partitions , and if you specify 4 Receivers,
messages from each partitions are received by a dedicated receiver. so your
receiving parallelism is defined by your number of partitions of your topic
.  Every receiver task will be scheduled evenly among nodes in your
cluster. There was a JIRA fixed in spark 1.5 which does even distribution
of receivers.



Now for RDD parallelism ( i.e parallelism while processing your RDD )  is
controlled by your Block Interval and Batch Interval.

If your block Interval is 200 Ms, there will be 5 blocks per second. If
your Batch Interval is 3 seconds, there will 15 blocks per batch. And every
Batch is one RDD , thus your RDD will be 15 partition , which will be
honored during processing the RDD ..


Regards,
Dibyendu


On Fri, Oct 2, 2015 at 9:40 PM, <nib...@free.fr> wrote:

> Ok so if I set for example 4 receivers (number of nodes), how RDD will be
> distributed over the nodes/core.
> For example in my example I have 4 nodes (with 2 cores)
>
> Tks
> Nicolas
>
>
> ----- Mail original -----
> De: "Dibyendu Bhattacharya" <dibyendu.bhattach...@gmail.com>
> À: nib...@free.fr
> Cc: "Cody Koeninger" <c...@koeninger.org>, "user" <user@spark.apache.org>
> Envoyé: Vendredi 2 Octobre 2015 18:01:59
> Objet: Re: Spark Streaming over YARN
>
>
> Hi,
>
>
> If you need to use Receiver based approach , you can try this one :
> https://github.com/dibbhatt/kafka-spark-consumer
>
>
> This is also part of Spark packages :
> http://spark-packages.org/package/dibbhatt/kafka-spark-consumer
>
>
> You just need to specify the number of Receivers you want for desired
> parallelism while receiving , and rest of the thing will be taken care by
> ReceiverLauncher.
>
>
> This Low level Receiver will give better parallelism both on receiving ,
> and on processing the RDD.
>
>
> Default Receiver based API ( KafkaUtils.createStream) using Kafka High
> level API and Kafka high Level API has serious issue to be used in
> production .
>
>
>
>
> Regards,
>
> Dibyendu
>
>
>
>
>
>
>
>
>
>
> On Fri, Oct 2, 2015 at 9:22 PM, < nib...@free.fr > wrote:
>
>
> From my understanding as soon as I use YARN I don't need to use
> parrallelisme (at least for RDD treatment)
> I don't want to use direct stream as I have to manage the offset
> positionning (in order to be able to start from the last offset treated
> after a spark job failure)
>
>
> ----- Mail original -----
> De: "Cody Koeninger" < c...@koeninger.org >
> À: "Nicolas Biau" < nib...@free.fr >
> Cc: "user" < user@spark.apache.org >
> Envoyé: Vendredi 2 Octobre 2015 17:43:41
> Objet: Re: Spark Streaming over YARN
>
>
>
>
> If you're using the receiver based implementation, and want more
> parallelism, you have to create multiple streams and union them together.
>
>
> Or use the direct stream.
>
>
> On Fri, Oct 2, 2015 at 10:40 AM, < nib...@free.fr > wrote:
>
>
> Hello,
> I have a job receiving data from kafka (4 partitions) and persisting data
> inside MongoDB.
> It works fine, but when I deploy it inside YARN cluster (4 nodes with 2
> cores) only on node is receiving all the kafka partitions and only one node
> is processing my RDD treatment (foreach function)
> How can I force YARN to use all the resources nodes and cores to process
> the data (receiver & RDD treatment)
>
> Tks a lot
> Nicolas
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
>

Re: Spark Streaming over YARN

Reply via email to