Hey all,

I seem to be experiencing the same thing as Stephen. I run Spark 1.2.1 with
Mesos 0.22.1, with Spark coming from the spark-1.2.1-bin-hadoop2.4.tgz
prebuilt package, and Mesos installed from the Mesosphere repositories. I
have been running with Spark standalone successfully for a while and now
trying to setup Mesos. Mesos is up and running, the UI at port 5050 reports
all slaves alive. I then run Spark shell with: `spark-shell --master
mesos://1.1.1.1:5050` (with 1.1.1.1 the master's ip address), which starts
up fine, with output:

    I0513 15:02:45.340287 28804 sched.cpp:448] Framework registered with
20150512-150459-2618695596-5050-3956-0009 15/05/13 15:02:45 INFO
mesos.MesosSchedulerBackend: Registered as framework ID
20150512-150459-2618695596-5050-3956-0009

and the framework shows up in the Mesos UI. Then when trying to run
something (e.g. 'val rdd = sc.txtFile("path"); rdd.count') fails with lost
executors. In /var/log/mesos-slave.ERROR on the slave instances there are
entries like:

    E0513 14:57:01.198995 13077 slave.cpp:3112] Container
'eaf33d36-dde5-498a-9ef1-70138810a38c' for executor
'20150512-145720-2618695596-5050-3082-S10' of framework
'20150512-150459-2618695596-5050-3956-0009' failed to start: Failed to
execute mesos-fetcher: Failed to chown work directory

>From what I can find, the work directory is in /tmp/mesos, where indeed I
see a directory structure with executor and framework IDs, with at the
leaves stdout and stderr files of size 0. Everything there is owned by
root, but I assume the processes are also run by root, so any chowning in
there should be possible.

I was thinking maybe it fails to fetch the Spark package executor? I
uploaded spark-1.2.1-bin-hadoop2.4.tgz to hdfs, SPARK_EXECUTOR_URI is set
in spark-env.sh, and in the Environment section of the web UI I see this
picked up in the spark.executor.uriparameter. I checked and the URI is
reachable by the slaves: an `hdfs dfs -stat $SPARK_EXECUTOR_URI` is
successful.

Any pointers?

Many thanks,
Sander

On Fri, May 1, 2015 at 8:35 AM Tim Chen <t...@mesosphere.io> wrote:

> Hi Stephen,
>
> It looks like Mesos slave was most likely not able to launch some mesos
> helper processes (fetcher probably?).
>
> How did you install Mesos? Did you build from source yourself?
>
> Please install Mesos through a package or actually from source run make
> install and run from the installed binary.
>
> Tim
>
> On Mon, Apr 27, 2015 at 11:11 AM, Stephen Carman <scar...@coldlight.com>
> wrote:
>
>>  So I installed spark on each of the slaves 1.3.1 built with hadoop2.6 I
>> just basically got the pre-built from the spark website…
>>
>>  I placed those compiled spark installs on each slave at /opt/spark
>>
>>  My spark properties seem to be getting picked up on my side fine…
>>
>>  The framework is registered in Mesos, it shows up just fine, it doesn’t
>> matter if I turn off the executor uri or not, but I always get the same
>> error…
>>
>>  org.apache.spark.SparkException: Job aborted due to stage failure: Task
>> 6 in stage 0.0 failed 4 times, most recent failure: Lost task 6.3 in stage
>> 0.0 (TID 23, 10.253.1.117): ExecutorLostFailure (executor
>> 20150424-104711-1375862026-5050-20113-S1 lost)
>> Driver stacktrace:
>> at org.apache.spark.scheduler.DAGScheduler.org
>> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1204)
>> at
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1193)
>> at
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
>> at
>> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>> at
>> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1192)
>> at
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
>> at
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
>> at scala.Option.foreach(Option.scala:236)
>> at
>> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
>> at
>> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
>> at
>> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
>> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>>
>>  These boxes are totally open to one another so they shouldn’t have any
>> firewall issues, everything seems to show up in mesos and spark just fine,
>> but actually running stuff totally blows up.
>>
>>  There is nothing in the stderr or stdout, it downloads the package and
>> untars it but doesn’t seem to do much after that. Any insights?
>>
>>  Steve
>>
>>
>>  On Apr 24, 2015, at 5:50 PM, Yang Lei <genia...@gmail.com> wrote:
>>
>> SPARK_PUBLIC_DNS, SPARK_LOCAL_IP, SPARK_LOCAL_HOST
>>
>>
>> This e-mail is intended solely for the above-mentioned recipient and it
>> may contain confidential or privileged information. If you have received it
>> in error, please notify us immediately and delete the e-mail. You must not
>> copy, distribute, disclose or take any action in reliance on it. In
>> addition, the contents of an attachment to this e-mail may contain software
>> viruses which could damage your own computer system. While ColdLight
>> Solutions, LLC has taken every reasonable precaution to minimize this risk,
>> we cannot accept liability for any damage which you sustain as a result of
>> software viruses. You should perform your own virus checks before opening
>> the attachment.
>>
>
>

Reply via email to