@DG; The key metrics should be
- Scheduling delay – its ideal state is to remain constant over time and ideally be less than the time of the microbatch window - The average job processing time should remain less than the micro-batch window - Number of Lost Jobs – even if there is a single Job lost that means that you have lost all messages for the DStream RDD processed by that job due to the previously described spark streaming memory leak condition and subsequent crash – described in previous postings submitted by me You can even go one step further and periodically issue “get/check free memory” to see whether it is decreasing relentlessly at a constant rate – if it touches a predetermined RAM threshold that should be your third metric Re the “back pressure” mechanism – this is a Feedback Loop mechanism and you can implement one on your own without waiting for Jiras and new features whenever they might be implemented by the Spark dev team – moreover you can avoid using slow mechanisms such as ZooKeeper and even incorporate some Machine Learning in your Feedback Loop to make it handle the message consumption rate more intelligently and benefit from ongoing online learning – BUT this is STILL about voluntarily sacrificing your performance in the name of keeping your system stable – it is not about scaling your system/solution In terms of how to scale the Spark Framework Dynamically – even though this is not supported at the moment out of the box I guess you can have a sys management framework spin dynamically a few more boxes (spark worker nodes), stop dynamically your currently running Spark Streaming Job, relaunch it with new params e.g. more Receivers, larger number of Partitions (hence tasks), more RAM per executor etc. Obviously this will cause some temporary delay in fact interruption in your processing but if the business use case can tolerate that then go for it From: Gerard Maas [mailto:gerard.m...@gmail.com] Sent: Thursday, May 28, 2015 12:36 PM To: dgoldenberg Cc: spark users Subject: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics? Hi, tl;dr At the moment (with a BIG disclaimer *) elastic scaling of spark streaming processes is not supported. Longer version. I assume that you are talking about Spark Streaming as the discussion is about handing Kafka streaming data. Then you have two things to consider: the Streaming receivers and the Spark processing cluster. Currently, the receiving topology is static. One receiver is allocated with each DStream instantiated and it will use 1 core in the cluster. Once the StreamingContext is started, this topology cannot be changed, therefore the number of Kafka receivers is fixed for the lifetime of your DStream. What we do is to calculate the cluster capacity and use that as a fixed upper bound (with a margin) for the receiver throughput. There's work in progress to add a reactive model to the receiver, where backpressure can be applied to handle overload conditions. See https://issues.apache.org/jira/browse/SPARK-7398 Once the data is received, it will be processed in a 'classical' Spark pipeline, so previous posts on spark resource scheduling might apply. Regarding metrics, the standard metrics subsystem of spark will report streaming job performance. Check the driver's metrics endpoint to peruse the available metrics: <driver>:<ui-port>/metrics/json -kr, Gerard. (*) Spark is a project that moves so fast that statements might be invalidated by new work every minute. On Thu, May 28, 2015 at 1:21 AM, dgoldenberg <dgoldenberg...@gmail.com> wrote: Hi, I'm trying to understand if there are design patterns for autoscaling Spark (add/remove slave machines to the cluster) based on the throughput. Assuming we can throttle Spark consumers, the respective Kafka topics we stream data from would start growing. What are some of the ways to generate the metrics on the number of new messages and the rate they are piling up? This perhaps is more of a Kafka question; I see a pretty sparse javadoc with the Metric interface and not much else... What are some of the ways to expand/contract the Spark cluster? Someone has mentioned Mesos... I see some info on Spark metrics in the Spark monitoring guide <https://spark.apache.org/docs/latest/monitoring.html> . Do we want to perhaps implement a custom sink that would help us autoscale up or down based on the throughput? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Autoscaling-Spark-cluster-based-on-topic-sizes-rate-of-growth-in-Kafka-or-Spark-s-metrics-tp23062.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org