Re: spark streaming doubt

Akhil Das Tue, 19 May 2015 02:09:02 -0700

spark.streaming.concurrentJobs takes an integer value, not boolean. If you
set it as 2 then 2 jobs will run parallel. Default value is 1 and the next
job will start once it completes the current one.



> Actually, in the current implementation of Spark Streaming and under
> default configuration, only job is active (i.e. under execution) at any
> point of time. So if one batch's processing takes longer than 10 seconds,
> then then next batch's jobs will stay queued.
> This can be changed with an experimental Spark property
> "spark.streaming.concurrentJobs" which is by default set to 1. Its not
> currently documented (maybe I should add it).
> The reason it is set to 1 is that concurrent jobs can potentially lead to
> weird sharing of resources and which can make it hard to debug the whether
> there is sufficient resources in the system to process the ingested data
> fast enough. With only 1 job running at a time, it is easy to see that if
> batch processing time < batch interval, then the system will be stable.
> Granted that this may not be the most efficient use of resources under
> certain conditions. We definitely hope to improve this in the future.


Copied from TD's answer written in SO
<http://stackoverflow.com/questions/23528006/how-jobs-are-assigned-to-executors-in-spark-streaming>
.

Non-receiver based streaming for example you can say are the fileStream,
directStream ones. You can read a bit of information from here
https://spark.apache.org/docs/1.3.1/streaming-kafka-integration.html

Thanks
Best Regards

On Tue, May 19, 2015 at 2:13 PM, Shushant Arora <shushantaror...@gmail.com>
wrote:

> Thanks Akhil.
> When I don't  set spark.streaming.concurrentJobs to true. Will the all
> pending jobs starts one by one after 1 jobs completes,or it does not
> creates jobs which could not be started at its desired interval.
>
> And Whats the difference and usage of Receiver vs non-receiver based
> streaming. Is there any documentation for that?
>
> On Tue, May 19, 2015 at 1:35 PM, Akhil Das <ak...@sigmoidanalytics.com>
> wrote:
>
>> It will be a single job running at a time by default (you can also
>> configure the spark.streaming.concurrentJobs to run jobs parallel which is
>> not recommended to put in production).
>>
>> Now, your batch duration being 1 sec and processing time being 2 minutes,
>> if you are using a receiver based streaming then ideally those receivers
>> will keep on receiving data while the job is running (which will accumulate
>> in memory if you set StorageLevel as MEMORY_ONLY and end up in block not
>> found exceptions as spark drops some blocks which are yet to process to
>> accumulate new blocks). If you are using a non-receiver based approach, you
>> will not have this problem of dropping blocks.
>>
>> Ideally, if your data is small and you have enough memory to hold your
>> data then it will run smoothly without any issues.
>>
>> Thanks
>> Best Regards
>>
>> On Tue, May 19, 2015 at 1:23 PM, Shushant Arora <
>> shushantaror...@gmail.com> wrote:
>>
>>> What happnes if in a streaming application one job is not yet finished
>>> and stream interval reaches. Does it starts next job or wait for first to
>>> finish and rest jobs will keep on accumulating in queue.
>>>
>>>
>>> Say I have a streaming application with stream interval of 1 sec, but my
>>> job takes 2 min to process 1 sec stream , what will happen ?  At any time
>>> there will be only one job running or multiple ?
>>>
>>>
>>
>

Re: spark streaming doubt

Reply via email to