Re: Tasks randomly stall when running on mesos

Dean Wampler Mon, 25 May 2015 12:01:33 -0700

Here is a link for builds of 1.4 RC2:

http://people.apache.org/~pwendell/spark-releases/spark-1.4.0-rc2-bin/


For a mvn repo, I believe the RC2 artifacts are here:

https://repository.apache.org/content/repositories/orgapachespark-1104/

A few experiments you might try:

1. Does spark-shell work? It might start fine, but make sure you can create
an RDD and use it, e.g., something like:

val rdd = sc.parallelize(Seq(1,2,3,4,5,6))
rdd foreach println

2. Try coarse grained mode, which has different logic for executor
management.

You can set it in $SPARK_HOME/conf/spark-defaults.conf file:

spark.mesos.coarse   true

Or, from this page
<http://spark.apache.org/docs/latest/running-on-mesos.html>, set the
property in a SparkConf object used to construct the SparkContext:

conf.set("spark.mesos.coarse", "true")

dean

Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
Typesafe <http://typesafe.com>
@deanwampler <http://twitter.com/deanwampler>
http://polyglotprogramming.com

On Mon, May 25, 2015 at 12:06 PM, Reinis Vicups <sp...@orbit-x.de> wrote:

>  Hello,
>
> I assume I am running spark in a fine-grained mode since I haven't changed
> the default here.
>
> One question regarding 1.4.0-RC1 - is there a mvn snapshot repository I
> could use for my project config? (I know that I have to download source and
> make-distribution for executor as well)
>
> thanks
> reinis
>
>
> On 25.05.2015 17:07, Iulian Dragoș wrote:
>
>
> On Mon, May 25, 2015 at 2:43 PM, Reinis Vicups <sp...@orbit-x.de> wrote:
>
>>  Hello,
>>
>> I am using Spark 1.3.1-hadoop2.4 with Mesos 0.22.1 with zookeeper and
>> running on a cluster with 3 nodes on 64bit ubuntu.
>>
>> My application is compiled with spark 1.3.1 (apparently with mesos 0.21.0
>> dependency), hadoop 2.5.1-mapr-1503 and akka 2.3.10. Only with this
>> combination I have succeeded to run spark-jobs on mesos at all. Different
>> versions are causing class loader issues.
>>
>> I am submitting spark jobs with spark-submit with mesos://zk://.../mesos.
>>
>
>  Are you using coarse grained or fine grained mode?
>
>  sandbox log of slave-node app01 (the one that stalls) shows following:
>>
>> 10:01:25.815506 35409 fetcher.cpp:214] Fetching URI
>> 'hdfs://dev-hadoop01/apps/spark-1.3.1-bin-hadoop2.4.tgz'
>> 10:01:26.497764 35409 fetcher.cpp:99] Fetching URI
>> 'hdfs://dev-hadoop01/apps/spark-1.3.1-bin-hadoop2.4.tgz' using Hadoop Client
>> 10:01:26.497869 35409 fetcher.cpp:109] Downloading resource from
>> 'hdfs://dev-hadoop01/apps/spark-1.3.1-bin-hadoop2.4.tgz' to
>> '/tmp/mesos/slaves/20150511-150924-3410235146-5050-1903-S3/frameworks/20150511-150924-3410235146-5050-1903-0249/executors/20150511-150924-3410235146-5050-1903-S3/runs/ec3a0f13-2f44-4952-bb23-4d48abaacc05/spark-1.3.1-bin-hadoop2.4.tgz'
>> 10:01:32.877717 35409 fetcher.cpp:78] Extracted resource
>> '/tmp/mesos/slaves/20150511-150924-3410235146-5050-1903-S3/frameworks/20150511-150924-3410235146-5050-1903-0249/executors/20150511-150924-3410235146-5050-1903-S3/runs/ec3a0f13-2f44-4952-bb23-4d48abaacc05/spark-1.3.1-bin-hadoop2.4.tgz'
>> into
>> '/tmp/mesos/slaves/20150511-150924-3410235146-5050-1903-S3/frameworks/20150511-150924-3410235146-5050-1903-0249/executors/20150511-150924-3410235146-5050-1903-S3/runs/ec3a0f13-2f44-4952-bb23-4d48abaacc05'
>> Using Spark's default log4j profile:
>> org/apache/spark/log4j-defaults.properties
>> 10:01:34 INFO MesosExecutorBackend: Registered signal handlers for [TERM,
>> HUP, INT]
>> 10:01:34.459292 35730 exec.cpp:132] Version: 0.22.0
>> *10:01:34 ERROR MesosExecutorBackend: Received launchTask but executor
>> was null*
>> 10:01:34.540870 35765 exec.cpp:206] Executor registered on slave
>> 20150511-150924-3410235146-5050-1903-S3
>> 10:01:34 INFO MesosExecutorBackend: Registered with Mesos as executor ID
>> 20150511-150924-3410235146-5050-1903-S3 with 1 cpus
>>
>
>  It looks like an inconsistent state on the Mesos scheduler. It tries to
> launch a task on a given slave before the executor has registered. This
> code was improved/refactored in 1.4, could you try 1.4.0-RC1?
>
>
Yes and note the second message after the error you highlighted; that's
when the executor would be registered with Mesos and the local object
created. 


>
>  iulian
>
>
>>  10:01:34 INFO SecurityManager: Changing view acls to...
>> 10:01:35 INFO Slf4jLogger: Slf4jLogger started
>> 10:01:35 INFO Remoting: Starting remoting
>> 10:01:35 INFO Remoting: Remoting started; listening on addresses
>> :[akka.tcp://sparkExecutor@app01:xxx]
>> 10:01:35 INFO Utils: Successfully started service 'sparkExecutor' on port
>> xxx.
>> 10:01:35 INFO AkkaUtils: Connecting to MapOutputTracker:
>> akka.tcp://sparkDriver@dev-web01/user/MapOutputTracker
>> 10:01:35 INFO AkkaUtils: Connecting to BlockManagerMaster:
>> akka.tcp://sparkDriver@dev-web01/user/BlockManagerMaster
>> 10:01:36 INFO DiskBlockManager: Created local directory at
>> /tmp/spark-52a6585a-f9f2-4ab6-bebc-76be99b0c51c/blockmgr-e6d79818-fe30-4b5c-bcd6-8fbc5a201252
>> 10:01:36 INFO MemoryStore: MemoryStore started with capacity 88.3 MB
>> 10:01:36 WARN NativeCodeLoader: Unable to load native-hadoop library for
>> your platform... using builtin-java classes where applicable
>> 10:01:36 INFO AkkaUtils: Connecting to OutputCommitCoordinator:
>> akka.tcp://sparkDriver@dev-web01/user/OutputCommitCoordinator
>> 10:01:36 INFO Executor: Starting executor ID
>> 20150511-150924-3410235146-5050-1903-S3 on host app01
>> 10:01:36 INFO NettyBlockTransferService: Server created on XXX
>> 10:01:36 INFO BlockManagerMaster: Trying to register BlockManager
>> 10:01:36 INFO BlockManagerMaster: Registered BlockManager
>> 10:01:36 INFO AkkaUtils: Connecting to HeartbeatReceiver:
>> akka.tcp://sparkDriver@dev-web01/user/HeartbeatReceiver
>>
>> As soon as spark-driver is aborted, following log entries are added to
>> the sandbox log of slave-node app01:
>>
>> 10:17:29.559433 35772 exec.cpp:379] Executor asked to shutdown
>> 10:17:29 WARN ReliableDeliverySupervisor: Association with remote system
>> [akka.tcp://sparkDriver@dev-web01] has failed, address is now gated for
>> [5000] ms. Reason is: [Disassociated]
>>
>> Successful Job shows instead following in spark-driver log:
>>
>> 08:03:19,862 INFO  o.a.s.s.TaskSetManager - Finished task 3.0 in stage
>> 1.0 (TID 7) in 1688 ms on app01 (1/4)
>> 08:03:19,869 INFO  o.a.s.s.TaskSetManager - Finished task 0.0 in stage
>> 1.0 (TID 4) in 1700 ms on app03 (2/4)
>> 08:03:19,874 INFO  o.a.s.s.TaskSetManager - Finished task 1.0 in stage
>> 1.0 (TID 5) in 1703 ms on app02 (3/4)
>> 08:03:19,878 INFO  o.a.s.s.TaskSetManager - Finished task 2.0 in stage
>> 1.0 (TID 6) in 1706 ms on app02 (4/4)
>> 08:03:19,878 INFO  o.a.s.s.DAGScheduler - Stage 1
>> (saveAsNewAPIHadoopDataset at ImportSparkJob.scala:90) finished in 1.718 s
>> 08:03:19,878 INFO  o.a.s.s.TaskSchedulerImpl - Removed TaskSet 1.0, whose
>> tasks have all completed, from pool
>> 08:03:19,886 INFO  o.a.s.s.DAGScheduler - Job 0 finished:
>> saveAsNewAPIHadoopDataset at ImportSparkJob.scala:90, took 16.946405 s
>>
>> this corresponds nicelly to sandbox logs of slave-nodes
>>
>> 08:03:19 INFO Executor: Finished task 3.0 in stage 1.0 (TID 7). 872 bytes
>> result sent to driver
>> 08:03:19 INFO Executor: Finished task 0.0 in stage 1.0 (TID 4). 872 bytes
>> result sent to driver
>> 08:03:19 INFO Executor: Finished task 1.0 in stage 1.0 (TID 5). 872 bytes
>> result sent to driver
>> 08:03:19 INFO Executor: Finished task 2.0 in stage 1.0 (TID 6). 872 bytes
>> result sent to driver
>> 08:03:20 WARN ReliableDeliverySupervisor: Association with remote system
>> [akka.tcp://sparkDriver@dev-web01] has failed, address is now gated for
>> [5000] ms. Reason is: [Disassociated].
>>
>
>
>
>  --
>
> --
> Iulian Dragos
>
>  ------
> Reactive Apps on the JVM
> www.typesafe.com
>
>

Re: Tasks randomly stall when running on mesos

Reply via email to