Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

Dmitry Goldenberg Thu, 28 May 2015 04:52:54 -0700

Thank you, Gerard.

We're looking at the receiver-less setup with Kafka Spark streaming so I'm
not sure how to apply your comments to that case (not that we have to use
receiver-less but it seems to offer some advantages over the
receiver-based).


As far as "the number of Kafka receivers is fixed for the lifetime of your
DStream" -- this may be OK to start with. What I'm researching is the
ability to add worker nodes to the Spark cluster when needed and remove
them when no longer needed.  Do I understand correctly that a single
receiver may cause work to be farmed out to multiple 'slave'
machines/worker nodes?  If that's the case, we're less concerned with
multiple receivers; we're concerned with the worker node cluster itself.

If we use the ConsumerOffsetChecker class in Kafka that Rajesh mentioned
and instrument dynamic adding/removal of machines, my subsequent questions
then are, a) will Spark sense the addition of a new node / is it sufficient
that the cluster manager is aware, then work just starts flowing there?
 and  b) what would be a way to gracefully remove a worker node when the
load subsides, so that no currently running Spark job is killed?

- Dmitry

On Thu, May 28, 2015 at 7:36 AM, Gerard Maas <gerard.m...@gmail.com> wrote:

> Hi,
>
> tl;dr At the moment (with a BIG disclaimer *) elastic scaling of spark
> streaming processes is not supported.
>
>
> *Longer version.*
>
> I assume that you are talking about Spark Streaming as the discussion is
> about handing Kafka streaming data.
>
> Then you have two things to consider: the Streaming receivers and the
> Spark processing cluster.
>
> Currently, the receiving topology is static. One receiver is allocated
> with each DStream instantiated and it will use 1 core in the cluster. Once
> the StreamingContext is started, this topology cannot be changed, therefore
> the number of Kafka receivers is fixed for the lifetime of your DStream.
> What we do is to calculate the cluster capacity and use that as a fixed
> upper bound (with a margin) for the receiver throughput.
>
> There's work in progress to add a reactive model to the receiver, where
> backpressure can be applied to handle overload conditions. See
> https://issues.apache.org/jira/browse/SPARK-7398
>
> Once the data is received, it will be processed in a 'classical' Spark
> pipeline, so previous posts on spark resource scheduling might apply.
>
> Regarding metrics, the standard metrics subsystem of spark will report
> streaming job performance. Check the driver's metrics endpoint to peruse
> the available metrics:
>
> <driver>:<ui-port>/metrics/json
>
> -kr, Gerard.
>
>
> (*) Spark is a project that moves so fast that statements might be
> invalidated by new work every minute.
>
> On Thu, May 28, 2015 at 1:21 AM, dgoldenberg <dgoldenberg...@gmail.com>
> wrote:
>
>> Hi,
>>
>> I'm trying to understand if there are design patterns for autoscaling
>> Spark
>> (add/remove slave machines to the cluster) based on the throughput.
>>
>> Assuming we can throttle Spark consumers, the respective Kafka topics we
>> stream data from would start growing.  What are some of the ways to
>> generate
>> the metrics on the number of new messages and the rate they are piling up?
>> This perhaps is more of a Kafka question; I see a pretty sparse javadoc
>> with
>> the Metric interface and not much else...
>>
>> What are some of the ways to expand/contract the Spark cluster? Someone
>> has
>> mentioned Mesos...
>>
>> I see some info on Spark metrics in  the Spark monitoring guide
>> <https://spark.apache.org/docs/latest/monitoring.html>  .  Do we want to
>> perhaps implement a custom sink that would help us autoscale up or down
>> based on the throughput?
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Autoscaling-Spark-cluster-based-on-topic-sizes-rate-of-growth-in-Kafka-or-Spark-s-metrics-tp23062.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>

Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

Reply via email to