Re: spark streaming doubt

Akhil Das Tue, 19 May 2015 08:30:46 -0700

On Tue, May 19, 2015 at 8:10 PM, Shushant Arora <shushantaror...@gmail.com>
wrote:


> So for Kafka+spark streaming, Receiver based streaming used highlevel api
> and non receiver based streaming used low level api.
>
> 1.In high level receiver based streaming does it registers consumers at
> each job start(whenever a new job is launched by streaming application say
> at each second)?
>

-> Receiver based streaming will always have the receiver running parallel
while your job is running, So by default for every 200ms
(spark.streaming.blockInterval) the receiver will generate a block of data
which is read from Kafka.



> 2.No of executors in highlevel receiver based jobs will always equal to no
> of partitions in topic ?
>

-> Not sure from where did you came up with this. For the non stream based
one, i think the number of partitions in spark will be equal to the number
of kafka partitions for the given topic.



> 3.Will data from a single topic be consumed by executors in parllel or
> only one receiver consumes in multiple threads and assign to executors in
> high level receiver based approach ?
>
> -> They will consume the data parallel. For the receiver based approach,
you can actually specify the number of receiver that you want to spawn for
consuming the messages.

>
>
>
> On Tue, May 19, 2015 at 2:38 PM, Akhil Das <ak...@sigmoidanalytics.com>
> wrote:
>
>> spark.streaming.concurrentJobs takes an integer value, not boolean. If
>> you set it as 2 then 2 jobs will run parallel. Default value is 1 and the
>> next job will start once it completes the current one.
>>
>>
>>> Actually, in the current implementation of Spark Streaming and under
>>> default configuration, only job is active (i.e. under execution) at any
>>> point of time. So if one batch's processing takes longer than 10 seconds,
>>> then then next batch's jobs will stay queued.
>>> This can be changed with an experimental Spark property
>>> "spark.streaming.concurrentJobs" which is by default set to 1. Its not
>>> currently documented (maybe I should add it).
>>> The reason it is set to 1 is that concurrent jobs can potentially lead
>>> to weird sharing of resources and which can make it hard to debug the
>>> whether there is sufficient resources in the system to process the ingested
>>> data fast enough. With only 1 job running at a time, it is easy to see that
>>> if batch processing time < batch interval, then the system will be stable.
>>> Granted that this may not be the most efficient use of resources under
>>> certain conditions. We definitely hope to improve this in the future.
>>
>>
>> Copied from TD's answer written in SO
>> <http://stackoverflow.com/questions/23528006/how-jobs-are-assigned-to-executors-in-spark-streaming>
>> .
>>
>> Non-receiver based streaming for example you can say are the fileStream,
>> directStream ones. You can read a bit of information from here
>> https://spark.apache.org/docs/1.3.1/streaming-kafka-integration.html
>>
>> Thanks
>> Best Regards
>>
>> On Tue, May 19, 2015 at 2:13 PM, Shushant Arora <
>> shushantaror...@gmail.com> wrote:
>>
>>> Thanks Akhil.
>>> When I don't  set spark.streaming.concurrentJobs to true. Will the all
>>> pending jobs starts one by one after 1 jobs completes,or it does not
>>> creates jobs which could not be started at its desired interval.
>>>
>>> And Whats the difference and usage of Receiver vs non-receiver based
>>> streaming. Is there any documentation for that?
>>>
>>> On Tue, May 19, 2015 at 1:35 PM, Akhil Das <ak...@sigmoidanalytics.com>
>>> wrote:
>>>
>>>> It will be a single job running at a time by default (you can also
>>>> configure the spark.streaming.concurrentJobs to run jobs parallel which is
>>>> not recommended to put in production).
>>>>
>>>> Now, your batch duration being 1 sec and processing time being 2
>>>> minutes, if you are using a receiver based streaming then ideally those
>>>> receivers will keep on receiving data while the job is running (which will
>>>> accumulate in memory if you set StorageLevel as MEMORY_ONLY and end up in
>>>> block not found exceptions as spark drops some blocks which are yet to
>>>> process to accumulate new blocks). If you are using a non-receiver based
>>>> approach, you will not have this problem of dropping blocks.
>>>>
>>>> Ideally, if your data is small and you have enough memory to hold your
>>>> data then it will run smoothly without any issues.
>>>>
>>>> Thanks
>>>> Best Regards
>>>>
>>>> On Tue, May 19, 2015 at 1:23 PM, Shushant Arora <
>>>> shushantaror...@gmail.com> wrote:
>>>>
>>>>> What happnes if in a streaming application one job is not yet finished
>>>>> and stream interval reaches. Does it starts next job or wait for first to
>>>>> finish and rest jobs will keep on accumulating in queue.
>>>>>
>>>>>
>>>>> Say I have a streaming application with stream interval of 1 sec, but
>>>>> my job takes 2 min to process 1 sec stream , what will happen ?  At any
>>>>> time there will be only one job running or multiple ?
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: spark streaming doubt

Reply via email to