Re: spark streaming doubt

Shushant Arora Mon, 13 Jul 2015 07:34:47 -0700

For second question

I am comparing 2 situtations of processing kafkaRDD.


case I - When I used foreachPartition to process kafka stream I am not able
to see any stream job timing interval like Time: 1429054870000 ms .
displayed on driver console at start of each stream batch. But it processed
each RDD and on webUI it showed jobs got created at each batch interval of
1 sec.

Case 2 -When I called mapPartition on kafkaStream RDD and then called any
action (say print()) at end of each stream interval I am getting on driver
console jobs getting created with batch interval Time: 1429054870000 ms ..

Why in case-I no information comes on driver console?

Thanks
Shushant




On Mon, Jul 13, 2015 at 7:22 PM, Cody Koeninger <c...@koeninger.org> wrote:

> Regarding your first question, having more partitions than you do
> executors usually means you'll have better utilization, because the
> workload will be distributed more evenly.  There's some degree of per-task
> overhead, but as long as you don't have a huge imbalance between number of
> tasks and number of executors that shouldn't be a large problem.
>
> I don't really understand your second question.
>
> On Sat, Jul 11, 2015 at 5:00 AM, Shushant Arora <shushantaror...@gmail.com
> > wrote:
>
>> 1.spark streaming 1.3 creates as many RDD Partitions as there are kafka
>> partitions in topic. Say I have 300 partitions in topic and 10 executors
>> and each with 3 cores so , is it means at a time only 10*3=30 partitions
>> are processed and then 30 like that since executors launch tasks per RDD
>> partitions , so I need in total; 300 tasks but since I have 30 cores(10
>> executors each with 3 cores) so these tasks will execute 30 after 30 till
>> 300.
>>
>> So reducing no of kafka paartitions to say 100 will speed up the
>> processing?
>>
>> 2.In spark streaming job when I processed the kafka stream using
>> foreachRDD
>>
>> directKafkaStream.foreachRDD(new function( public void call(  vi)){
>> v1.foreachPartition(new function(){public void call(){
>> //..process partition
>> }})
>>
>> });
>>
>> since foreachRDD is operation so it spawns spark job but these jobs are
>> not coming on driver console like in map and print function as
>>
>> 1.spark streaming 1.3 creates as many RDD Partitions as there are kafka
>> partitions in topic. Say I have 300 partitions in topic and 10 executors
>> and each with 3 cores so , is it means at a time only 10*3=30 partitions
>> are processed and then 30 like that since executors launch tasks per RDD
>> partitions , so I need in total; 300 tasks but since I have 30 cores(10
>> executors each with 3 cores) so these tasks will execute 30 after 30 till
>> 300.
>>
>> So reducing no of kafka paartitions to say 100 will speed up the
>> processing?
>>
>> 2.In spark streaming job when I processed the kafka stream using
>> foreachRDD
>>
>> directKafkaStream.foreachRDD(new function( public void call(  vi)){
>> v1.foreachPartition(new function(){public void call(){
>> //..process partition
>> }})
>>
>> });
>>
>> since foreachRDD is operation so it spawns spark job but these jobs
>> timings are not coming on driver console like in map and print function as
>>
>>
>> -------------------------------------------
>> Time: 1429054870000 ms
>> -------------------------------------------
>> ------------------------------------------
>> Time: 1429054871000 ms
>> -------------------------------------------
>>
>> ..................
>>
>> Why is it so?
>>
>>
>> Thanks
>> Shushant
>>
>>
>>
>>
>>
>>
>

Re: spark streaming doubt

Reply via email to