Re: Spark Streaming and reducing latency

Dmitry Goldenberg Mon, 18 May 2015 03:48:08 -0700

Thanks, Akhil. So what do folks typically do to increase/contract the capacity? 
Do you plug in some cluster auto-scaling solution to make this elastic?


Does Spark have any hooks for instrumenting auto-scaling?

In other words, how do you avoid overwheling the receivers in a scenario when 
your system's input can be unpredictable, based on users' activity?

> On May 17, 2015, at 11:04 AM, Akhil Das <ak...@sigmoidanalytics.com> wrote:
> 
> With receiver based streaming, you can actually specify 
> spark.streaming.blockInterval which is the interval at which the receiver 
> will fetch data from the source. Default value is 200ms and hence if your 
> batch duration is 1 second, it will produce 5 blocks of data. And yes, with 
> sparkstreaming when your processing time goes beyond your batch duration and 
> you are having a higher data consumption then you will overwhelm the 
> receiver's memory and hence will throw up block not found exceptions. 
> 
> Thanks
> Best Regards
> 
>> On Sun, May 17, 2015 at 7:21 PM, dgoldenberg <dgoldenberg...@gmail.com> 
>> wrote:
>> I keep hearing the argument that the way Discretized Streams work with Spark
>> Streaming is a lot more of a batch processing algorithm than true streaming.
>> For streaming, one would expect a new item, e.g. in a Kafka topic, to be
>> available to the streaming consumer immediately.
>> 
>> With the discretized streams, streaming is done with batch intervals i.e.
>> the consumer has to wait the interval to be able to get at the new items. If
>> one wants to reduce latency it seems the only way to do this would be by
>> reducing the batch interval window. However, that may lead to a great deal
>> of churn, with many requests going into Kafka out of the consumers,
>> potentially with no results whatsoever as there's nothing new in the topic
>> at the moment.
>> 
>> Is there a counter-argument to this reasoning? What are some of the general
>> approaches to reduce latency  folks might recommend? Or, perhaps there are
>> ways of dealing with this at the streaming API level?
>> 
>> If latency is of great concern, is it better to look into streaming from
>> something like Flume where data is pushed to consumers rather than pulled by
>> them? Are there techniques, in that case, to ensure the consumers don't get
>> overwhelmed with new data?
>> 
>> 
>> 
>> --
>> View this message in context: 
>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-and-reducing-latency-tp22922.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>

Re: Spark Streaming and reducing latency

Reply via email to