Sander, I eventually solved this problem via the --[no-]switch_user flag, which is set to true by default. I set this to false, which would have the user that owns the process run the job, otherwise it was my username (scarman) running the job, which would fail because obviously my username didn’t exist there. When ran as root, it ran totally fine with no problems what so ever.
Hopefully this works for you too, Steve > On May 13, 2015, at 11:45 AM, Sander van Dijk <sgvand...@gmail.com> wrote: > > Hey all, > > I seem to be experiencing the same thing as Stephen. I run Spark 1.2.1 with > Mesos 0.22.1, with Spark coming from the spark-1.2.1-bin-hadoop2.4.tgz > prebuilt package, and Mesos installed from the Mesosphere repositories. I > have been running with Spark standalone successfully for a while and now > trying to setup Mesos. Mesos is up and running, the UI at port 5050 reports > all slaves alive. I then run Spark shell with: `spark-shell --master > mesos://1.1.1.1:5050` (with 1.1.1.1 the master's ip address), which starts up > fine, with output: > > I0513 15:02:45.340287 28804 sched.cpp:448] Framework registered with > 20150512-150459-2618695596-5050-3956-0009 15/05/13 15:02:45 INFO > mesos.MesosSchedulerBackend: Registered as framework ID > 20150512-150459-2618695596-5050-3956-0009 > > and the framework shows up in the Mesos UI. Then when trying to run something > (e.g. 'val rdd = sc.txtFile("path"); rdd.count') fails with lost executors. > In /var/log/mesos-slave.ERROR on the slave instances there are entries like: > > E0513 14:57:01.198995 13077 slave.cpp:3112] Container > 'eaf33d36-dde5-498a-9ef1-70138810a38c' for executor > '20150512-145720-2618695596-5050-3082-S10' of framework > '20150512-150459-2618695596-5050-3956-0009' failed to start: Failed to > execute mesos-fetcher: Failed to chown work directory > > From what I can find, the work directory is in /tmp/mesos, where indeed I see > a directory structure with executor and framework IDs, with at the leaves > stdout and stderr files of size 0. Everything there is owned by root, but I > assume the processes are also run by root, so any chowning in there should be > possible. > > I was thinking maybe it fails to fetch the Spark package executor? I uploaded > spark-1.2.1-bin-hadoop2.4.tgz to hdfs, SPARK_EXECUTOR_URI is set in > spark-env.sh, and in the Environment section of the web UI I see this picked > up in the spark.executor.uriparameter. I checked and the URI is reachable by > the slaves: an `hdfs dfs -stat $SPARK_EXECUTOR_URI` is successful. > > Any pointers? > > Many thanks, > Sander > > On Fri, May 1, 2015 at 8:35 AM Tim Chen <t...@mesosphere.io> wrote: > Hi Stephen, > > It looks like Mesos slave was most likely not able to launch some mesos > helper processes (fetcher probably?). > > How did you install Mesos? Did you build from source yourself? > > Please install Mesos through a package or actually from source run make > install and run from the installed binary. > > Tim > > On Mon, Apr 27, 2015 at 11:11 AM, Stephen Carman <scar...@coldlight.com> > wrote: > So I installed spark on each of the slaves 1.3.1 built with hadoop2.6 I just > basically got the pre-built from the spark website… > > I placed those compiled spark installs on each slave at /opt/spark > > My spark properties seem to be getting picked up on my side fine… > > <Screen Shot 2015-04-27 at 10.30.01 AM.png> > The framework is registered in Mesos, it shows up just fine, it doesn’t > matter if I turn off the executor uri or not, but I always get the same error… > > org.apache.spark.SparkException: Job aborted due to stage failure: Task 6 in > stage 0.0 failed 4 times, most recent failure: Lost task 6.3 in stage 0.0 > (TID 23, 10.253.1.117): ExecutorLostFailure (executor > 20150424-104711-1375862026-5050-20113-S1 lost) > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1204) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1193) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1192) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693) > at scala.Option.foreach(Option.scala:236) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > > These boxes are totally open to one another so they shouldn’t have any > firewall issues, everything seems to show up in mesos and spark just fine, > but actually running stuff totally blows up. > > There is nothing in the stderr or stdout, it downloads the package and untars > it but doesn’t seem to do much after that. Any insights? > > Steve > > >> On Apr 24, 2015, at 5:50 PM, Yang Lei <genia...@gmail.com> wrote: >> >> SPARK_PUBLIC_DNS, SPARK_LOCAL_IP, SPARK_LOCAL_HOST > > This e-mail is intended solely for the above-mentioned recipient and it may > contain confidential or privileged information. If you have received it in > error, please notify us immediately and delete the e-mail. You must not copy, > distribute, disclose or take any action in reliance on it. In addition, the > contents of an attachment to this e-mail may contain software viruses which > could damage your own computer system. While ColdLight Solutions, LLC has > taken every reasonable precaution to minimize this risk, we cannot accept > liability for any damage which you sustain as a result of software viruses. > You should perform your own virus checks before opening the attachment. > This e-mail is intended solely for the above-mentioned recipient and it may contain confidential or privileged information. If you have received it in error, please notify us immediately and delete the e-mail. You must not copy, distribute, disclose or take any action in reliance on it. In addition, the contents of an attachment to this e-mail may contain software viruses which could damage your own computer system. While ColdLight Solutions, LLC has taken every reasonable precaution to minimize this risk, we cannot accept liability for any damage which you sustain as a result of software viruses. You should perform your own virus checks before opening the attachment. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org