I've finally resolved my issue! It turned out that it was not related to driver-master-worker connectivity settings.

The problem was caused my mlib jar version mismatch:
I noticed that I was using build.sbt from AMPCamp example which referenced mllib v0.9.0, but I was running it on Spark 0.9.2. SBT downloaded mllib jar 0.9.0 during packaging, but I did not pay attention to it.

However, looks like mllib 0.9.0 jar does not work correctly on Spark 0.9.2. When I changed mllib version in dependency list, application works perfectly.

Thanks anyway for your time and willingness to help!

Irina

On 04.10.14 00:17, Yana Kadiyska wrote:
I don't think it's a red herring... (btw. spark.driver.host needs to be
set to the IP or  FQDN of the machine where you're running the program).

I am running 0.9.2 on CDH4 and the beginning of my executor log looks
like below (I've obfuscated the IP -- this is the log from executor
a100-2-200-245). My driver is running on a100-2-200-238. I am not
specifically setting spark.driver.host or the port but depending on how
your machine is setup you might need to:

|SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
14/10/03 18:14:48 INFO slf4j.Slf4jLogger: Slf4jLogger started
14/10/03 18:14:48 INFO Remoting: Starting remoting
14/10/03 18:14:48 INFO Remoting: Remoting started; listening on addresses 
:[akka.tcp://sparkExecutor@a100-2-200-245:56760]
14/10/03 18:14:48 INFO Remoting: Remoting now listens on addresses: 
[akka.tcp://sparkExecutor@a100-2-200-245:56760]
**14/10/03 18:14:48 INFO executor.CoarseGrainedExecutorBackend: Connecting to 
driver: akka.tcp://spark@a100-2-200-238:61505/user/CoarseGrainedScheduler**
14/10/03 18:14:48 INFO worker.WorkerWatcher: Connecting to worker 
akka.tcp://sparkWorker@a100-2-200-245:48067/user/Worker
14/10/03 18:14:48 INFO worker.WorkerWatcher: Successfully connected to 
akka.tcp://sparkWorker@a100-2-200-245:48067/user/Worker
**14/10/03 18:14:49 INFO executor.CoarseGrainedExecutorBackend: Successfully 
registered with driver**
14/10/03 18:14:49 INFO slf4j.Slf4jLogger: Slf4jLogger started
14/10/03 18:14:49 INFO Remoting: Starting remoting
|

​
If you look at the lines with ** this is where the driver successfully
connects and at this point you should see your app show up in the UI
under "Running applications"...The worker log you're posting -- is that
the log that stored under work/app-<id>/<executor-id>/stderr? The first
line you show in that log is

  INFO worker.Worker: Executor
     app-20141002131901-0002/9 finished with state FAILED

but I imagine something prior to that would say why the executor failed?

On Fri, Oct 3, 2014 at 2:56 PM, Irina Fedulova <fedul...@gmail.com
<mailto:fedul...@gmail.com>> wrote:

    Yana, many thanks for looking into this!

    I am not running spark-shell in local mode, I am really starting
    spark-shell with --master spark://master:7077 and run in cluster mode.

    Second thing is I tried to set "spark.driver.host" to "master" both
    in scala app when creating context, and in conf/spark-defaults.conf
    file, but this did not make any difference. Worker logs still have
    same messages:
    14/10/03 13:37:30 ERROR remote.EndpointWriter: AssociationError
    [akka.tcp://sparkWorker@host2:__51414] ->
    [akka.tcp://sparkExecutor@__host2:53851]: Error [Association failed
    with [akka.tcp://sparkExecutor@__host2:53851]] [
    akka.remote.__EndpointAssociationException: Association failed with
    [akka.tcp://sparkExecutor@__host2:53851]
    Caused by:
    akka.remote.transport.netty.__NettyTransport$$anonfun$__associate$1$$anon$2:
    Connection refused: host2/xxx.xx.xx.xx:53851
    ]

    note that host1, host2 etc are slave hostnames, and each slave has
    error message about itself: host1:<some random port> cannot connect
    to host1:<some random port>.

    However I noticed that after running successfully SparkPi app log
    also is populated with similar "connection refused" messages, but
    this does not lead to application death... So these worker logs are
    probably a false clue.



    On 03.10.14 19:37, Yana Kadiyska wrote:

        when you're running spark-shell and the example, are you actually
        specifying --master spark://master:7077 as shown here:
        
http://spark.apache.org/docs/__latest/programming-guide.html#__initializing-spark
        
<http://spark.apache.org/docs/latest/programming-guide.html#initializing-spark>

        because if you're not, your spark-shell is running in local mode
        and not
        actually connecting to the cluster. Also, if you run spark-shell
        against
        the cluster, you'll see it listed under the Running applications
        in the
        master UI. It would be pretty odd for spark shell to connect
        successfully to the cluster but for your app to not
        connect...(which is
        why I suspect that you're running spark-shell local)

        Another thing to check, the executors need to connect back to your
        driver, so it could be that you have to set the driver host or
        driver
        port...in fact looking at your executor log, this seems fairly
        likely:
        is host1/xxx.xx.xx.xx:45542 the machine where your driver is
        running? is
        that host/port reachable from the worker machines?

        On Fri, Oct 3, 2014 at 5:32 AM, Irina Fedulova
        <fedul...@gmail.com <mailto:fedul...@gmail.com>
        <mailto:fedul...@gmail.com <mailto:fedul...@gmail.com>>> wrote:

             Hi,

             I have set up Spark 0.9.2 standalone cluster using CDH5 and
             pre-built spark distribution archive for Hadoop 2. I was
        not using
             spark-ec2 scripts because I am not on EC2 cloud.

             Spark-shell seems to be working properly -- I am able to
        perform
             simple RDD operations, as well as e.g. SparkPi standalone
        example
             works well when run via `run-example`. Web UI shows all workers
             connected.

             However, standalone Scala application gets "connection refused"
             messages. I think this has something to do with configuration,
             because spark-shell and SparkPi works well. I verified that
             .setMaster and .setSparkHome are properly assigned within
        scala app.

             Is there anything else in configuration of standalone scala
        app on
             spark that I am missing?
             I would very much appreciate any clues.

             Namely, I am trying to run MovieLensALS.scala example from
        AMPCamp
             big data mini course

        
(http://ampcamp.berkeley.edu/____big-data-mini-course/movie-____recommendation-with-mllib.html
        
<http://ampcamp.berkeley.edu/__big-data-mini-course/movie-__recommendation-with-mllib.html>

        
<http://ampcamp.berkeley.edu/__big-data-mini-course/movie-__recommendation-with-mllib.html
        
<http://ampcamp.berkeley.edu/big-data-mini-course/movie-recommendation-with-mllib.html>__>__).

             Here is error which I get when try to run compiled jar:
             ---------------
             root@master:~/machine-____learning/scala# sbt/sbt package "run
             /movielens/medium"
             Launching sbt from sbt/sbt-launch-0.12.4.jar
             [info] Loading project definition from
             /root/training/machine-____learning/scala/project
             [info] Set current project to movielens-als (in build
             file:/root/training/machine-____learning/scala/)
             [info] Compiling 1 Scala source to

        
/root/training/machine-____learning/scala/target/scala-2.____10/classes...
             [warn] there were 2 deprecation warning(s); re-run with
        -deprecation
             for details
             [warn] one warning found
             [info] Packaging

        
/root/training/machine-____learning/scala/target/scala-2.____10/movielens-als_2.10-0.0.__jar
             ...
             [info] Done packaging.
             [success] Total time: 6 s, completed Oct 2, 2014 1:19:00 PM
             [info] Running MovieLensALS /movielens/medium
             master = spark://master:7077
             log4j:WARN No appenders could be found for logger
             (akka.event.slf4j.Slf4jLogger)____.
             log4j:WARN Please initialize the log4j system properly.
             log4j:WARN See
        http://logging.apache.org/____log4j/1.2/faq.html#noconfig
        <http://logging.apache.org/__log4j/1.2/faq.html#noconfig>

             <http://logging.apache.org/__log4j/1.2/faq.html#noconfig
        <http://logging.apache.org/log4j/1.2/faq.html#noconfig>> for
        more info.
             14/10/02 13:19:01 WARN NativeCodeLoader: Unable to load
             native-hadoop library for your platform... using builtin-java
             classes where applicable
             HERE
             THERE
             14/10/02 13:19:02 INFO FileInputFormat: Total input paths
        to process : 1
             14/10/02 13:19:03 ERROR TaskSchedulerImpl: Lost executor 0
        on host2:
             remote Akka client disassociated
             14/10/02 13:19:03 WARN TaskSetManager: Lost TID 1 (task 0.0:1)
             14/10/02 13:19:03 WARN TaskSetManager: Lost TID 0 (task 0.0:0)
             14/10/02 13:19:03 ERROR TaskSchedulerImpl: Lost executor 4
        on host5:
             remote Akka client disassociated
             14/10/02 13:19:03 WARN TaskSetManager: Lost TID 3 (task 0.0:1)
             14/10/02 13:19:03 ERROR TaskSchedulerImpl: Lost executor 1
        on host4:
             remote Akka client disassociated
             14/10/02 13:19:03 WARN TaskSetManager: Lost TID 2 (task 0.0:0)
             14/10/02 13:19:03 WARN TaskSetManager: Lost TID 4 (task 0.0:1)
             14/10/02 13:19:03 ERROR TaskSchedulerImpl: Lost executor 3
        on host3:
             remote Akka client disassociated
             14/10/02 13:19:03 WARN TaskSetManager: Lost TID 6 (task 0.0:0)
             14/10/02 13:19:03 ERROR TaskSchedulerImpl: Lost executor 2
        on host1:
             remote Akka client disassociated
             14/10/02 13:19:03 WARN TaskSetManager: Lost TID 5 (task 0.0:1)
             14/10/02 13:19:03 WARN TaskSetManager: Lost TID 7 (task 0.0:0)
             14/10/02 13:19:04 ERROR TaskSchedulerImpl: Lost executor 6
        on host4:
             remote Akka client disassociated
             14/10/02 13:19:04 WARN TaskSetManager: Lost TID 8 (task 0.0:0)
             14/10/02 13:19:04 WARN TaskSetManager: Lost TID 9 (task 0.0:1)
             14/10/02 13:19:04 ERROR TaskSchedulerImpl: Lost executor 5
        on host2:
             remote Akka client disassociated
             14/10/02 13:19:04 WARN TaskSetManager: Lost TID 10 (task 0.0:1)
             14/10/02 13:19:04 ERROR TaskSchedulerImpl: Lost executor 7
        on host5:
             remote Akka client disassociated
             14/10/02 13:19:04 WARN TaskSetManager: Lost TID 11 (task 0.0:0)
             14/10/02 13:19:04 WARN TaskSetManager: Lost TID 12 (task 0.0:1)
             14/10/02 13:19:04 ERROR TaskSchedulerImpl: Lost executor 8
        on host3:
             remote Akka client disassociated
             14/10/02 13:19:04 WARN TaskSetManager: Lost TID 13 (task 0.0:1)
             14/10/02 13:19:04 ERROR TaskSchedulerImpl: Lost executor 9
        on host1:
             remote Akka client disassociated
             14/10/02 13:19:04 WARN TaskSetManager: Lost TID 14 (task 0.0:0)
             14/10/02 13:19:04 WARN TaskSetManager: Lost TID 15 (task 0.0:1)
             14/10/02 13:19:05 ERROR AppClient$ClientActor: Master
        removed our
             application: FAILED; stopping client
             14/10/02 13:19:05 WARN SparkDeploySchedulerBackend:
        Disconnected
             from Spark cluster! Waiting for reconnection...
             14/10/02 13:19:06 ERROR TaskSchedulerImpl: Lost executor 11 on
             host5: remote Akka client disassociated
             14/10/02 13:19:06 WARN TaskSetManager: Lost TID 17 (task 0.0:0)
             14/10/02 13:19:06 WARN TaskSetManager: Lost TID 16 (task 0.0:1)
             ---------------

             And this is error log on one of the workers:
             ---------------
             14/10/02 13:19:05 INFO worker.Worker: Executor
             app-20141002131901-0002/9 finished with state FAILED
        message Command
             exited with code 1 exitStatus 1
             14/10/02 13:19:05 INFO actor.LocalActorRef: Message

        
[akka.remote.transport.____ActorTransportAdapter$____DisassociateUnderlying]
             from Actor[akka://sparkWorker/____deadLetters] to

        
Actor[akka://sparkWorker/____system/transports/____akkaprotocolmanager.tcp0/____akkaProtocol-tcp%3A%2F%____2FsparkWorker%40xxx.xx.xx.xx%____3A57719-15#1504298502]
             was not delivered. [6] dead letters encountered. This
        logging can be
             turned off or adjusted with configuration settings
             'akka.log-dead-letters' and
        'akka.log-dead-letters-during-____shutdown'.
             14/10/02 13:19:05 ERROR remote.EndpointWriter: AssociationError
             [akka.tcp://sparkWorker@host1:____47421] ->
             [akka.tcp://sparkExecutor@____host1:45542]: Error
        [Association failed
             with [akka.tcp://sparkExecutor@____host1:45542]] [
             akka.remote.____EndpointAssociationException: Association
        failed with
             [akka.tcp://sparkExecutor@____host1:45542]
             Caused by:

        
akka.remote.transport.netty.____NettyTransport$$anonfun$____associate$1$$anon$2:
             Connection refused: host1/xxx.xx.xx.xx:45542
             ]
             14/10/02 13:19:05 ERROR remote.EndpointWriter: AssociationError
             [akka.tcp://sparkWorker@host1:____47421] ->
             [akka.tcp://sparkExecutor@____host1:45542]: Error
        [Association failed
             with [akka.tcp://sparkExecutor@____host1:45542]] [
             akka.remote.____EndpointAssociationException: Association
        failed with
             [akka.tcp://sparkExecutor@____host1:45542]
             Caused by:

        
akka.remote.transport.netty.____NettyTransport$$anonfun$____associate$1$$anon$2:
             Connection refused: host1/xxx.xx.xx.xx:45542
             ]
             14/10/02 13:19:05 ERROR remote.EndpointWriter: AssociationError
             [akka.tcp://sparkWorker@host1:____47421] ->
             [akka.tcp://sparkExecutor@____host1:45542]: Error
        [Association failed
             with [akka.tcp://sparkExecutor@____host1:45542]] [
             akka.remote.____EndpointAssociationException: Association
        failed with
             [akka.tcp://sparkExecutor@____host1:45542]
             Caused by:

        
akka.remote.transport.netty.____NettyTransport$$anonfun$____associate$1$$anon$2:
             Connection refused: host1/xxx.xx.xx.xx:45542
             ---------------

             Thanks!
             Irina


        
------------------------------____----------------------------__--__---------
             To unsubscribe, e-mail: user-unsubscribe@spark.apache.____org
             <mailto:user-unsubscribe@__spark.apache.org
        <mailto:user-unsubscr...@spark.apache.org>>
             For additional commands, e-mail: user-h...@spark.apache.org
        <mailto:user-h...@spark.apache.org>
             <mailto:user-help@spark.__apache.org
        <mailto:user-h...@spark.apache.org>>




---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to