Thanks, Akhil. So what do folks typically do to increase/contract the capacity? Do you plug in some cluster auto-scaling solution to make this elastic?
Does Spark have any hooks for instrumenting auto-scaling? In other words, how do you avoid overwheling the receivers in a scenario when your system's input can be unpredictable, based on users' activity? > On May 17, 2015, at 11:04 AM, Akhil Das <ak...@sigmoidanalytics.com> wrote: > > With receiver based streaming, you can actually specify > spark.streaming.blockInterval which is the interval at which the receiver > will fetch data from the source. Default value is 200ms and hence if your > batch duration is 1 second, it will produce 5 blocks of data. And yes, with > sparkstreaming when your processing time goes beyond your batch duration and > you are having a higher data consumption then you will overwhelm the > receiver's memory and hence will throw up block not found exceptions. > > Thanks > Best Regards > >> On Sun, May 17, 2015 at 7:21 PM, dgoldenberg <dgoldenberg...@gmail.com> >> wrote: >> I keep hearing the argument that the way Discretized Streams work with Spark >> Streaming is a lot more of a batch processing algorithm than true streaming. >> For streaming, one would expect a new item, e.g. in a Kafka topic, to be >> available to the streaming consumer immediately. >> >> With the discretized streams, streaming is done with batch intervals i.e. >> the consumer has to wait the interval to be able to get at the new items. If >> one wants to reduce latency it seems the only way to do this would be by >> reducing the batch interval window. However, that may lead to a great deal >> of churn, with many requests going into Kafka out of the consumers, >> potentially with no results whatsoever as there's nothing new in the topic >> at the moment. >> >> Is there a counter-argument to this reasoning? What are some of the general >> approaches to reduce latency folks might recommend? Or, perhaps there are >> ways of dealing with this at the streaming API level? >> >> If latency is of great concern, is it better to look into streaming from >> something like Flume where data is pushed to consumers rather than pulled by >> them? Are there techniques, in that case, to ensure the consumers don't get >> overwhelmed with new data? >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-and-reducing-latency-tp22922.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >