What are ideas around Spark cluster for streaming purposes ?
What is better standalone / Mesos / YARN ?

Please share cluster details and size of data and type of processing.
(multiple processing points) (architecture or similar)

I see folks using YARN cluster for streaming purposes.

Regards,
Deepak

On Thu, Aug 13, 2015 at 9:12 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) <deepuj...@gmail.com> wrote:

> I am looking to decide what is best for my production grade spark
> application(s).
>
> YARN
> =====
>
>    1. YARN supports security. When Spark is run over YARN the
>    communication between processes can use secure authentication through
>    Kerberos.
>    2. Spark standalone cluster can only run Spark jobs and nothing else.
>    With YARN you can have different kinds of jobs like M/R, or Spark.
>    3. YARN scheduler has multiple features like Queues, hierarchical
>    queues with pluggable policies, auto placement of apps into queues, easy
>    installation, ACLs for queues. These are missing with standalone spark
>    scheduler. Resources are more intelligently and dynamically used.
>    4. Spark standalone scheduler requires each application to run an
>    executor on every node in cluster, whereas with YARN you can run
>    executor(s) on subset of nodes. I haven't tested it.
>    5. On YARN spark supports driver to run on the client machine itself
>    (yarn-client) which requires the client application to run for the lifetime
>    of application. With yarn-cluster mode, the spark driver will run on
>    Application master and hence the client program can either exit or do
>    something else. This feature might be missing with standalone scheduler.
>    6. YARN provides finer control of resources like CPU cores. Number of
>    executors per node is configurable with YARN depending on the number of
>    CPUs present on the node, this is missing on Mesos. It might be
>    available with future releases.
>    7. Standalone mode requires management of daemon services. It also
>    requires a Zookeeper setup as Spark master node needs to be highly
>    available to avoid single point of failure.
>    8. Most of the existing users of Hadoop cluster have large amounts
>    data (TBs/PBs) on residing on the cluster and hence Spark applications can
>    make use of data locality when running on YARN cluster. Making it available
>    on standalone cluster might be a challenge.
>    9. I do not think there are performance impacts of running a spark
>    application on YARN/Mesos/Standalone cluster. Might require a test.
>    10. Mesos & Spark both were developed at Amplab, so both might be
>    better compatible with each other. However I do not have any working
>    knowledge of Mesos.
>
>
>
> I was thinking what are advantages and disadvantages of running Spark over
> Mesos and Spark over Standalone cluster. This will help me ( and others on
> the verge of using Spark systems) to decide which direction to go.
>
> Regards,
> Deepak
>
> On Wed, 12 Aug 2015 at 10:28 PM Tim Chen <t...@mesosphere.io> wrote:
>
>> I'm not sure what you're looking for, since you can't really compare
>> Standalone with YARN or Mesos, as Standalone is assuming the Spark
>> workers/master owns the cluster, and YARN/Mesos is trying to share the
>> cluster among different applications/frameworks.
>>
>> And when you refer to resource utilization, what exactly does it mean to
>> you? Is it the ability to maximize the usage of your resources with
>> multiple applications in mind, or just how much configuration Spark allows
>> you to in each mode?
>>
>> Tim
>>
>> On Wed, Aug 12, 2015 at 2:16 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) <deepuj...@gmail.com>
>> wrote:
>>
>>> Do we have any comparisons in terms of resource utilization, scheduling
>>> of running Spark in the below three modes
>>> 1) Standalone
>>> 2) over YARN
>>> 3) over Mesos
>>>
>>> Can some one share resources (thoughts/URLs) on this area.
>>>
>>>
>>> --
>>> Deepak
>>>
>>>
>>


-- 
Deepak

Reply via email to