Thanks Akhil andDibyendu. Does in high level receiver based streaming executors run on receivers itself to have data localisation ? Or its always data is transferred to executor nodes and executor nodes differ in each run of job but receiver node remains same(same machines) throughout life of streaming application unless node failure happens?
On Tue, May 19, 2015 at 9:29 PM, Dibyendu Bhattacharya < dibyendu.bhattach...@gmail.com> wrote: > Just to add, there is a Receiver based Kafka consumer which uses Kafka Low > Level Consumer API. > > http://spark-packages.org/package/dibbhatt/kafka-spark-consumer > > > Regards, > Dibyendu > > On Tue, May 19, 2015 at 9:00 PM, Akhil Das <ak...@sigmoidanalytics.com> > wrote: > >> >> On Tue, May 19, 2015 at 8:10 PM, Shushant Arora < >> shushantaror...@gmail.com> wrote: >> >>> So for Kafka+spark streaming, Receiver based streaming used highlevel >>> api and non receiver based streaming used low level api. >>> >>> 1.In high level receiver based streaming does it registers consumers at >>> each job start(whenever a new job is launched by streaming application say >>> at each second)? >>> >> >> -> Receiver based streaming will always have the receiver running >> parallel while your job is running, So by default for every 200ms >> (spark.streaming.blockInterval) the receiver will generate a block of data >> which is read from Kafka. >> >> >> >>> 2.No of executors in highlevel receiver based jobs will always equal to >>> no of partitions in topic ? >>> >> >> -> Not sure from where did you came up with this. For the non stream >> based one, i think the number of partitions in spark will be equal to the >> number of kafka partitions for the given topic. >> >> >> >>> 3.Will data from a single topic be consumed by executors in parllel or >>> only one receiver consumes in multiple threads and assign to executors in >>> high level receiver based approach ? >>> >>> -> They will consume the data parallel. For the receiver based >> approach, you can actually specify the number of receiver that you want to >> spawn for consuming the messages. >> >>> >>> >>> >>> On Tue, May 19, 2015 at 2:38 PM, Akhil Das <ak...@sigmoidanalytics.com> >>> wrote: >>> >>>> spark.streaming.concurrentJobs takes an integer value, not boolean. If >>>> you set it as 2 then 2 jobs will run parallel. Default value is 1 and the >>>> next job will start once it completes the current one. >>>> >>>> >>>>> Actually, in the current implementation of Spark Streaming and under >>>>> default configuration, only job is active (i.e. under execution) at any >>>>> point of time. So if one batch's processing takes longer than 10 seconds, >>>>> then then next batch's jobs will stay queued. >>>>> This can be changed with an experimental Spark property >>>>> "spark.streaming.concurrentJobs" which is by default set to 1. Its not >>>>> currently documented (maybe I should add it). >>>>> The reason it is set to 1 is that concurrent jobs can potentially lead >>>>> to weird sharing of resources and which can make it hard to debug the >>>>> whether there is sufficient resources in the system to process the >>>>> ingested >>>>> data fast enough. With only 1 job running at a time, it is easy to see >>>>> that >>>>> if batch processing time < batch interval, then the system will be stable. >>>>> Granted that this may not be the most efficient use of resources under >>>>> certain conditions. We definitely hope to improve this in the future. >>>> >>>> >>>> Copied from TD's answer written in SO >>>> <http://stackoverflow.com/questions/23528006/how-jobs-are-assigned-to-executors-in-spark-streaming> >>>> . >>>> >>>> Non-receiver based streaming for example you can say are the >>>> fileStream, directStream ones. You can read a bit of information from here >>>> https://spark.apache.org/docs/1.3.1/streaming-kafka-integration.html >>>> >>>> Thanks >>>> Best Regards >>>> >>>> On Tue, May 19, 2015 at 2:13 PM, Shushant Arora < >>>> shushantaror...@gmail.com> wrote: >>>> >>>>> Thanks Akhil. >>>>> When I don't set spark.streaming.concurrentJobs to true. Will the >>>>> all pending jobs starts one by one after 1 jobs completes,or it does not >>>>> creates jobs which could not be started at its desired interval. >>>>> >>>>> And Whats the difference and usage of Receiver vs non-receiver based >>>>> streaming. Is there any documentation for that? >>>>> >>>>> On Tue, May 19, 2015 at 1:35 PM, Akhil Das <ak...@sigmoidanalytics.com >>>>> > wrote: >>>>> >>>>>> It will be a single job running at a time by default (you can also >>>>>> configure the spark.streaming.concurrentJobs to run jobs parallel which >>>>>> is >>>>>> not recommended to put in production). >>>>>> >>>>>> Now, your batch duration being 1 sec and processing time being 2 >>>>>> minutes, if you are using a receiver based streaming then ideally those >>>>>> receivers will keep on receiving data while the job is running (which >>>>>> will >>>>>> accumulate in memory if you set StorageLevel as MEMORY_ONLY and end up in >>>>>> block not found exceptions as spark drops some blocks which are yet to >>>>>> process to accumulate new blocks). If you are using a non-receiver based >>>>>> approach, you will not have this problem of dropping blocks. >>>>>> >>>>>> Ideally, if your data is small and you have enough memory to hold >>>>>> your data then it will run smoothly without any issues. >>>>>> >>>>>> Thanks >>>>>> Best Regards >>>>>> >>>>>> On Tue, May 19, 2015 at 1:23 PM, Shushant Arora < >>>>>> shushantaror...@gmail.com> wrote: >>>>>> >>>>>>> What happnes if in a streaming application one job is not yet >>>>>>> finished and stream interval reaches. Does it starts next job or wait >>>>>>> for >>>>>>> first to finish and rest jobs will keep on accumulating in queue. >>>>>>> >>>>>>> >>>>>>> Say I have a streaming application with stream interval of 1 sec, >>>>>>> but my job takes 2 min to process 1 sec stream , what will happen ? At >>>>>>> any >>>>>>> time there will be only one job running or multiple ? >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >