What are ideas around Spark cluster for streaming purposes ? What is better standalone / Mesos / YARN ?
Please share cluster details and size of data and type of processing. (multiple processing points) (architecture or similar) I see folks using YARN cluster for streaming purposes. Regards, Deepak On Thu, Aug 13, 2015 at 9:12 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) <deepuj...@gmail.com> wrote: > I am looking to decide what is best for my production grade spark > application(s). > > YARN > ===== > > 1. YARN supports security. When Spark is run over YARN the > communication between processes can use secure authentication through > Kerberos. > 2. Spark standalone cluster can only run Spark jobs and nothing else. > With YARN you can have different kinds of jobs like M/R, or Spark. > 3. YARN scheduler has multiple features like Queues, hierarchical > queues with pluggable policies, auto placement of apps into queues, easy > installation, ACLs for queues. These are missing with standalone spark > scheduler. Resources are more intelligently and dynamically used. > 4. Spark standalone scheduler requires each application to run an > executor on every node in cluster, whereas with YARN you can run > executor(s) on subset of nodes. I haven't tested it. > 5. On YARN spark supports driver to run on the client machine itself > (yarn-client) which requires the client application to run for the lifetime > of application. With yarn-cluster mode, the spark driver will run on > Application master and hence the client program can either exit or do > something else. This feature might be missing with standalone scheduler. > 6. YARN provides finer control of resources like CPU cores. Number of > executors per node is configurable with YARN depending on the number of > CPUs present on the node, this is missing on Mesos. It might be > available with future releases. > 7. Standalone mode requires management of daemon services. It also > requires a Zookeeper setup as Spark master node needs to be highly > available to avoid single point of failure. > 8. Most of the existing users of Hadoop cluster have large amounts > data (TBs/PBs) on residing on the cluster and hence Spark applications can > make use of data locality when running on YARN cluster. Making it available > on standalone cluster might be a challenge. > 9. I do not think there are performance impacts of running a spark > application on YARN/Mesos/Standalone cluster. Might require a test. > 10. Mesos & Spark both were developed at Amplab, so both might be > better compatible with each other. However I do not have any working > knowledge of Mesos. > > > > I was thinking what are advantages and disadvantages of running Spark over > Mesos and Spark over Standalone cluster. This will help me ( and others on > the verge of using Spark systems) to decide which direction to go. > > Regards, > Deepak > > On Wed, 12 Aug 2015 at 10:28 PM Tim Chen <t...@mesosphere.io> wrote: > >> I'm not sure what you're looking for, since you can't really compare >> Standalone with YARN or Mesos, as Standalone is assuming the Spark >> workers/master owns the cluster, and YARN/Mesos is trying to share the >> cluster among different applications/frameworks. >> >> And when you refer to resource utilization, what exactly does it mean to >> you? Is it the ability to maximize the usage of your resources with >> multiple applications in mind, or just how much configuration Spark allows >> you to in each mode? >> >> Tim >> >> On Wed, Aug 12, 2015 at 2:16 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) <deepuj...@gmail.com> >> wrote: >> >>> Do we have any comparisons in terms of resource utilization, scheduling >>> of running Spark in the below three modes >>> 1) Standalone >>> 2) over YARN >>> 3) over Mesos >>> >>> Can some one share resources (thoughts/URLs) on this area. >>> >>> >>> -- >>> Deepak >>> >>> >> -- Deepak