Hi Ishaaq, answers inline from what I know, I had like to be corrected though.
On Tue, Apr 15, 2014 at 5:58 PM, ishaaq <ish...@gmail.com> wrote: > Hi all, > I am evaluating Spark to use here at my work. > > We have an existing Hadoop 1.x install which I planning to upgrade to > Hadoop > 2.3. > > This is not really a requirement for spark, if you are doing for some other reason great ! > I am trying to work out whether I should install YARN or simply just setup > a > Spark standalone cluster. We already use ZooKeeper so it isn't a problem to > setup HA. I am puzzled however as to how the Spark nodes can coordinate on > data locality - i.e., assuming I install the nodes on the same machines as > the DFS data nodes, I don't understand how Spark can work out which nodes > should get which splits of the jobs? > > This happens exactly the same way hadoop's mapreduce figures out data locality. Since we support hadoop's inputformats(which also has the information on how data is partitioned) etc. So having spark workers share the same nodes as your DFS is a good idea. > Anyway, my bigger question remains: YARN or standalone? Which is the more > stable option currently? Which is the more future-proof option? > > Well I think standalone is stable enough for all purposes and Spark's yarn support has been keeping up with latest hadoop versions too. It depends on the fact that if you are already using yarn and don't want the hassle of setting up another cluster manager you can probably prefer yarn. > Thanks, > Ishaaq > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/standalone-vs-YARN-tp4271.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. >