We fix the receivers rate at which it should consume at any given point of time. Also we have a back-pressuring mechanism attached to the receivers so it won't simply crashes in the "unceremonious way" like Evo said. Mesos has some sort of auto-scaling (read it somewhere), may be you can look into that also.
Thanks Best Regards On Mon, May 18, 2015 at 5:20 PM, Evo Eftimov <evo.efti...@isecc.com> wrote: > And if you want to genuinely “reduce the latency” (still within the > boundaries of the micro-batch) THEN you need to design and finely tune the > Parallel Programming / Execution Model of your application. The > objective/metric here is: > > > > a) Consume all data within your selected micro-batch window WITHOUT > any artificial message rate limits > > b) The above will result in a certain size of Dstream RDD per > micro-batch. > > c) The objective now is to Process that RDD WITHIN the time of the > micro-batch (and also account for temporary message rate spike etc which > may further increase the size of the RDD) – this will avoid any clogging up > of the app and will process your messages at the lowest latency possible in > a micro-batch architecture > > d) You achieve the objective stated in c by designing, varying and > experimenting with various aspects of the Spark Streaming Parallel > Programming and Execution Model – e.g. number of receivers, number of > threads per receiver, number of executors, number of cores, RAM allocated > to executors, number of RDD partitions which correspond to the number of > parallel threads operating on the RDD etc etc > > > > Re the “unceremonious removal of DStream RDDs” from RAM by Spark Streaming > when the available RAM is exhausted due to high message rate and which > crashes your (hence clogged up) application the name of the condition is: > > > > Loss was due to java.lang.Exception > > java.lang.Exception: *Could not compute split, block* > *input-4-1410542878200 not found* > > > > *From:* Evo Eftimov [mailto:evo.efti...@isecc.com] > *Sent:* Monday, May 18, 2015 12:13 PM > *To:* 'Dmitry Goldenberg'; 'Akhil Das' > *Cc:* 'user@spark.apache.org' > *Subject:* RE: Spark Streaming and reducing latency > > > > You can use > > > > spark.streaming.receiver.maxRate > > not set > > Maximum rate (number of records per second) at which each receiver will > receive data. Effectively, each stream will consume at most this number of > records per second. Setting this configuration to 0 or a negative number > will put no limit on the rate. See the deployment guide > <https://spark.apache.org/docs/latest/streaming-programming-guide.html#deploying-applications> > in the Spark Streaming programing guide for mode details. > > > > > > Another way is to implement a feedback loop in your receivers monitoring > the performance metrics of your application/job and based on that adjusting > automatically the receiving rate – BUT all these have nothing to do with > “reducing the latency” – they simply prevent your application/job from > clogging up – the nastier effect of which is when S[ark Streaming starts > removing In Memory RDDs from RAM before they are processed by the job – > that works fine in Spark Batch (ie removing RDDs from RAM based on LRU) but > in Spark Streaming when done in this “unceremonious way” it simply Crashes > the application > > > > *From:* Dmitry Goldenberg [mailto:dgoldenberg...@gmail.com > <dgoldenberg...@gmail.com>] > *Sent:* Monday, May 18, 2015 11:46 AM > *To:* Akhil Das > *Cc:* user@spark.apache.org > *Subject:* Re: Spark Streaming and reducing latency > > > > Thanks, Akhil. So what do folks typically do to increase/contract the > capacity? Do you plug in some cluster auto-scaling solution to make this > elastic? > > > > Does Spark have any hooks for instrumenting auto-scaling? > > In other words, how do you avoid overwheling the receivers in a scenario > when your system's input can be unpredictable, based on users' activity? > > > On May 17, 2015, at 11:04 AM, Akhil Das <ak...@sigmoidanalytics.com> > wrote: > > With receiver based streaming, you can actually > specify spark.streaming.blockInterval which is the interval at which the > receiver will fetch data from the source. Default value is 200ms and hence > if your batch duration is 1 second, it will produce 5 blocks of data. And > yes, with sparkstreaming when your processing time goes beyond your batch > duration and you are having a higher data consumption then you will > overwhelm the receiver's memory and hence will throw up block not found > exceptions. > > > Thanks > > Best Regards > > > > On Sun, May 17, 2015 at 7:21 PM, dgoldenberg <dgoldenberg...@gmail.com> > wrote: > > I keep hearing the argument that the way Discretized Streams work with > Spark > Streaming is a lot more of a batch processing algorithm than true > streaming. > For streaming, one would expect a new item, e.g. in a Kafka topic, to be > available to the streaming consumer immediately. > > With the discretized streams, streaming is done with batch intervals i.e. > the consumer has to wait the interval to be able to get at the new items. > If > one wants to reduce latency it seems the only way to do this would be by > reducing the batch interval window. However, that may lead to a great deal > of churn, with many requests going into Kafka out of the consumers, > potentially with no results whatsoever as there's nothing new in the topic > at the moment. > > Is there a counter-argument to this reasoning? What are some of the general > approaches to reduce latency folks might recommend? Or, perhaps there are > ways of dealing with this at the streaming API level? > > If latency is of great concern, is it better to look into streaming from > something like Flume where data is pushed to consumers rather than pulled > by > them? Are there techniques, in that case, to ensure the consumers don't get > overwhelmed with new data? > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-and-reducing-latency-tp22922.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > > >