Depends on what you're reusing multiple times (if anything). Read http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence
On Wed, Jun 10, 2015 at 12:18 AM, Dmitry Goldenberg < dgoldenberg...@gmail.com> wrote: > At which point would I call cache()? I just want the runtime to spill to > disk when necessary without me having to know when the "necessary" is. > > > On Thu, Jun 4, 2015 at 9:42 AM, Cody Koeninger <c...@koeninger.org> wrote: > >> direct stream isn't a receiver, it isn't required to cache data anywhere >> unless you want it to. >> >> If you want it, just call cache. >> >> On Thu, Jun 4, 2015 at 8:20 AM, Dmitry Goldenberg < >> dgoldenberg...@gmail.com> wrote: >> >>> "set the storage policy for the DStream RDDs to MEMORY AND DISK" - it >>> appears the storage level can be specified in the createStream methods but >>> not createDirectStream... >>> >>> >>> On Thu, May 28, 2015 at 9:05 AM, Evo Eftimov <evo.efti...@isecc.com> >>> wrote: >>> >>>> You can also try Dynamic Resource Allocation >>>> >>>> >>>> >>>> >>>> https://spark.apache.org/docs/1.3.1/job-scheduling.html#dynamic-resource-allocation >>>> >>>> >>>> >>>> Also re the Feedback Loop for automatic message consumption rate >>>> adjustment – there is a “dumb” solution option – simply set the storage >>>> policy for the DStream RDDs to MEMORY AND DISK – when the memory gets >>>> exhausted spark streaming will resort to keeping new RDDs on disk which >>>> will prevent it from crashing and hence loosing them. Then some memory will >>>> get freed and it will resort back to RAM and so on and so forth >>>> >>>> >>>> >>>> >>>> >>>> Sent from Samsung Mobile >>>> >>>> -------- Original message -------- >>>> >>>> From: Evo Eftimov >>>> >>>> Date:2015/05/28 13:22 (GMT+00:00) >>>> >>>> To: Dmitry Goldenberg >>>> >>>> Cc: Gerard Maas ,spark users >>>> >>>> Subject: Re: Autoscaling Spark cluster based on topic sizes/rate of >>>> growth in Kafka or Spark's metrics? >>>> >>>> >>>> >>>> You can always spin new boxes in the background and bring them into the >>>> cluster fold when fully operational and time that with job relaunch and >>>> param change >>>> >>>> >>>> >>>> Kafka offsets are mabaged automatically for you by the kafka clients >>>> which keep them in zoomeeper dont worry about that ad long as you shut down >>>> your job gracefuly. Besides msnaging the offsets explicitly is not a big >>>> deal if necessary >>>> >>>> >>>> >>>> >>>> >>>> Sent from Samsung Mobile >>>> >>>> >>>> >>>> -------- Original message -------- >>>> >>>> From: Dmitry Goldenberg >>>> >>>> Date:2015/05/28 13:16 (GMT+00:00) >>>> >>>> To: Evo Eftimov >>>> >>>> Cc: Gerard Maas ,spark users >>>> >>>> Subject: Re: Autoscaling Spark cluster based on topic sizes/rate of >>>> growth in Kafka or Spark's metrics? >>>> >>>> >>>> >>>> Thanks, Evo. Per the last part of your comment, it sounds like we will >>>> need to implement a job manager which will be in control of starting the >>>> jobs, monitoring the status of the Kafka topic(s), shutting jobs down and >>>> marking them as ones to relaunch, scaling the cluster up/down by >>>> adding/removing machines, and relaunching the 'suspended' (shut down) jobs. >>>> >>>> >>>> >>>> I suspect that relaunching the jobs may be tricky since that means >>>> keeping track of the starter offsets in Kafka topic(s) from which the jobs >>>> started working on. >>>> >>>> >>>> >>>> Ideally, we'd want to avoid a re-launch. The 'suspension' and >>>> relaunching of jobs, coupled with the wait for the new machines to come >>>> online may turn out quite time-consuming which will make for lengthy >>>> request times, and our requests are not asynchronous. Ideally, the >>>> currently running jobs would continue to run on the machines currently >>>> available in the cluster. >>>> >>>> >>>> >>>> In the scale-down case, the job manager would want to signal to Spark's >>>> job scheduler not to send work to the node being taken out, find out when >>>> the last job has finished running on the node, then take the node out. >>>> >>>> >>>> >>>> This is somewhat like changing the number of cylinders in a car engine >>>> while the car is running... >>>> >>>> >>>> >>>> Sounds like a great candidate for a set of enhancements in Spark... >>>> >>>> >>>> >>>> On Thu, May 28, 2015 at 7:52 AM, Evo Eftimov <evo.efti...@isecc.com> >>>> wrote: >>>> >>>> @DG; The key metrics should be >>>> >>>> >>>> >>>> - Scheduling delay – its ideal state is to remain constant >>>> over time and ideally be less than the time of the microbatch window >>>> >>>> - The average job processing time should remain less than the >>>> micro-batch window >>>> >>>> - Number of Lost Jobs – even if there is a single Job lost >>>> that means that you have lost all messages for the DStream RDD processed by >>>> that job due to the previously described spark streaming memory leak >>>> condition and subsequent crash – described in previous postings submitted >>>> by me >>>> >>>> >>>> >>>> You can even go one step further and periodically issue “get/check free >>>> memory” to see whether it is decreasing relentlessly at a constant rate – >>>> if it touches a predetermined RAM threshold that should be your third >>>> metric >>>> >>>> >>>> >>>> Re the “back pressure” mechanism – this is a Feedback Loop mechanism >>>> and you can implement one on your own without waiting for Jiras and new >>>> features whenever they might be implemented by the Spark dev team – >>>> moreover you can avoid using slow mechanisms such as ZooKeeper and even >>>> incorporate some Machine Learning in your Feedback Loop to make it handle >>>> the message consumption rate more intelligently and benefit from ongoing >>>> online learning – BUT this is STILL about voluntarily sacrificing your >>>> performance in the name of keeping your system stable – it is not about >>>> scaling your system/solution >>>> >>>> >>>> >>>> In terms of how to scale the Spark Framework Dynamically – even though >>>> this is not supported at the moment out of the box I guess you can have a >>>> sys management framework spin dynamically a few more boxes (spark worker >>>> nodes), stop dynamically your currently running Spark Streaming Job, >>>> relaunch it with new params e.g. more Receivers, larger number of >>>> Partitions (hence tasks), more RAM per executor etc. Obviously this will >>>> cause some temporary delay in fact interruption in your processing but if >>>> the business use case can tolerate that then go for it >>>> >>>> >>>> >>>> *From:* Gerard Maas [mailto:gerard.m...@gmail.com] >>>> *Sent:* Thursday, May 28, 2015 12:36 PM >>>> *To:* dgoldenberg >>>> *Cc:* spark users >>>> *Subject:* Re: Autoscaling Spark cluster based on topic sizes/rate of >>>> growth in Kafka or Spark's metrics? >>>> >>>> >>>> >>>> Hi, >>>> >>>> >>>> >>>> tl;dr At the moment (with a BIG disclaimer *) elastic scaling of spark >>>> streaming processes is not supported. >>>> >>>> >>>> >>>> >>>> >>>> *Longer version.* >>>> >>>> >>>> >>>> I assume that you are talking about Spark Streaming as the discussion >>>> is about handing Kafka streaming data. >>>> >>>> >>>> >>>> Then you have two things to consider: the Streaming receivers and the >>>> Spark processing cluster. >>>> >>>> >>>> >>>> Currently, the receiving topology is static. One receiver is allocated >>>> with each DStream instantiated and it will use 1 core in the cluster. Once >>>> the StreamingContext is started, this topology cannot be changed, therefore >>>> the number of Kafka receivers is fixed for the lifetime of your DStream. >>>> >>>> What we do is to calculate the cluster capacity and use that as a fixed >>>> upper bound (with a margin) for the receiver throughput. >>>> >>>> >>>> >>>> There's work in progress to add a reactive model to the receiver, where >>>> backpressure can be applied to handle overload conditions. See >>>> https://issues.apache.org/jira/browse/SPARK-7398 >>>> >>>> >>>> >>>> Once the data is received, it will be processed in a 'classical' Spark >>>> pipeline, so previous posts on spark resource scheduling might apply. >>>> >>>> >>>> >>>> Regarding metrics, the standard metrics subsystem of spark will report >>>> streaming job performance. Check the driver's metrics endpoint to peruse >>>> the available metrics: >>>> >>>> >>>> >>>> <driver>:<ui-port>/metrics/json >>>> >>>> >>>> >>>> -kr, Gerard. >>>> >>>> >>>> >>>> >>>> >>>> (*) Spark is a project that moves so fast that statements might be >>>> invalidated by new work every minute. >>>> >>>> >>>> >>>> On Thu, May 28, 2015 at 1:21 AM, dgoldenberg <dgoldenberg...@gmail.com> >>>> wrote: >>>> >>>> Hi, >>>> >>>> I'm trying to understand if there are design patterns for autoscaling >>>> Spark >>>> (add/remove slave machines to the cluster) based on the throughput. >>>> >>>> Assuming we can throttle Spark consumers, the respective Kafka topics we >>>> stream data from would start growing. What are some of the ways to >>>> generate >>>> the metrics on the number of new messages and the rate they are piling >>>> up? >>>> This perhaps is more of a Kafka question; I see a pretty sparse javadoc >>>> with >>>> the Metric interface and not much else... >>>> >>>> What are some of the ways to expand/contract the Spark cluster? Someone >>>> has >>>> mentioned Mesos... >>>> >>>> I see some info on Spark metrics in the Spark monitoring guide >>>> <https://spark.apache.org/docs/latest/monitoring.html> . Do we want >>>> to >>>> perhaps implement a custom sink that would help us autoscale up or down >>>> based on the throughput? >>>> >>>> >>>> >>>> -- >>>> View this message in context: >>>> http://apache-spark-user-list.1001560.n3.nabble.com/Autoscaling-Spark-cluster-based-on-topic-sizes-rate-of-growth-in-Kafka-or-Spark-s-metrics-tp23062.html >>>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>>> For additional commands, e-mail: user-h...@spark.apache.org >>>> >>>> >>>> >>>> >>>> >>> >>> >> >