Re: Spark standalone network configuration problems

Shannon Quinn Fri, 27 Jun 2014 06:08:27 -0700

I put the settings as you specified in spark-env.sh for the master. WhenI run start-all.sh, the web UI shows both the worker on the master(machine1) and the slave worker (machine2) as ALIVE and ready, with themaster URL at spark://192.168.1.101. However, when I run spark-submit,it immediately crashes with

py4j.protocol.Py4JJavaError14/06/27 09:01:32 ERROR Remoting: Remotingerror: [Startup failed]

akka.remote.RemoteTransportException: Startup failed
[...]

org.jboss.netty.channel.ChannelException: Failed to bind to/192.168.1.101:5060

[...]
java.net.BindException: Address already in use.
[...]

This seems entirely contrary to intuition; why would Spark be unable tobind to the exact IP:port set for the master?


On 6/27/14, 1:54 AM, Akhil Das wrote:

Hi Shannon,

How about a setting like the following? (just removed the quotes)

export SPARK_MASTER_IP=192.168.1.101
export SPARK_MASTER_PORT=5060
#export SPARK_LOCAL_IP=127.0.0.1

Not sure whats happening in your case, it could be that your system isnot able to bind to 192.168.1.101 address. What is the spark:// masterurl that you are seeing there in the webUI? (It should bespark://192.168.1.101:7077 in your case).




Thanks
Best Regards

On Fri, Jun 27, 2014 at 5:47 AM, Shannon Quinn <squ...@gatech.edu<mailto:squ...@gatech.edu>> wrote:


    In the interest of completeness, this is how I invoke spark:

    [on master]

    > sbin/start-all.sh
    > spark-submit --py-files extra.py main.py

    iPhone'd

    On Jun 26, 2014, at 17:29, Shannon Quinn <squ...@gatech.edu
    <mailto:squ...@gatech.edu>> wrote:

    My *best guess* (please correct me if I'm wrong) is that the
    master (machine1) is sending the command to the worker (machine2)
    with the localhost argument as-is; that is, machine2 isn't doing
    any weird address conversion on its end.

    Consequently, I've been focusing on the settings of the
    master/machine1. But I haven't found anything to indicate where
    the localhost argument could be coming from. /etc/hosts lists
    only 127.0.0.1 as localhost; spark-defaults.conf list
    spark.master as the full IP address (not 127.0.0.1); spark-env.sh
    on the master also lists the full IP under SPARK_MASTER_IP. The
    *only* place on the master where it's associated with localhost
    is SPARK_LOCAL_IP.

    In looking at the logs of the worker spawned on master, it's also
    receiving a "spark://localhost:5060" argument, but since it
    resides on the master that works fine. Is it possible that the
    master is, for some reason, passing
    "spark://{SPARK_LOCAL_IP}:5060" to the workers?

    That was my motivation behind commenting out SPARK_LOCAL_IP;
    however, that's when the master crashes immediately due to the
    address already being in use.

    Any ideas? Thanks!

    Shannon

    On 6/26/14, 10:14 AM, Akhil Das wrote:

    Can you paste your spark-env.sh file?

    Thanks
    Best Regards


    On Thu, Jun 26, 2014 at 7:01 PM, Shannon Quinn
    <squ...@gatech.edu <mailto:squ...@gatech.edu>> wrote:

        Both /etc/hosts have each other's IP addresses in them.
        Telneting from machine2 to machine1 on port 5060 works just
        fine.

        Here's the output of lsof:

        user@machine1:~/spark/spark-1.0.0-bin-hadoop2$
        <mailto:user@machine1:%7E/spark/spark-1.0.0-bin-hadoop2$>
        lsof -i:5060
        COMMAND   PID   USER   FD   TYPE   DEVICE SIZE/OFF NODE NAME
        java    23985 user   30u  IPv6 11092354  0t0  TCP
        machine1:sip (LISTEN)
        java    23985 user   40u  IPv6 11099560  0t0  TCP
        machine1:sip->machine1:48315 (ESTABLISHED)
        java    23985 user   52u  IPv6 11100405  0t0  TCP
        machine1:sip->machine2:54476 (ESTABLISHED)
        java    24157 user   40u  IPv6 11092413  0t0  TCP
        machine1:48315->machine1:sip (ESTABLISHED)

        Ubuntu seems to recognize 5060 as the standard port for
        "sip"; it's not actually running anything there besides
        Spark, it just does a s/5060/sip/g.

        Is there something to the fact that every time I comment out
        SPARK_LOCAL_IP in spark-env, it crashes immediately upon
        spark-submit due to the "address already being in use"? Or
        am I barking up the wrong tree on that one?

        Thanks again for all your help; I hope we can knock this one
        out.

        Shannon


        On 6/26/14, 9:13 AM, Akhil Das wrote:

        Do you have <ip>         machine1 in your workers
        /etc/hosts also? If so try telneting from your machine2 to
        machine1 on port 5060. Also make sure nothing else is
        running on port 5060 other than Spark (*/lsof -i:5060/*)

        Thanks
        Best Regards


        On Thu, Jun 26, 2014 at 6:35 PM, Shannon Quinn
        <squ...@gatech.edu <mailto:squ...@gatech.edu>> wrote:

            Still running into the same problem. /etc/hosts on the
            master says

            127.0.0.1    localhost
            <ip> machine1

            <ip> is the same address set in spark-env.sh for
            SPARK_MASTER_IP. Any other ideas?


            On 6/26/14, 3:11 AM, Akhil Das wrote:

            Hi Shannon,

            It should be a configuration issue, check in your
            /etc/hosts and make sure localhost is not associated
            with the SPARK_MASTER_IP you provided.

            Thanks
            Best Regards


            On Thu, Jun 26, 2014 at 6:37 AM, Shannon Quinn
            <squ...@gatech.edu <mailto:squ...@gatech.edu>> wrote:

                Hi all,

                I have a 2-machine Spark network I've set up: a
                master and worker on machine1, and worker on
                machine2. When I run 'sbin/start-all.sh',
                everything starts up as it should. I see both
                workers listed on the UI page. The logs of both
                workers indicate successful registration with the
                Spark master.

                The problems begin when I attempt to submit a job:
                I get an "address already in use" exception that
                crashes the program. It says "Failed to bind to "
                and lists the exact port and address of the master.

                At this point, the only items I have set in my
                spark-env.sh are SPARK_MASTER_IP and
                SPARK_MASTER_PORT (non-standard, set to 5060).

                The next step I took, then, was to explicitly set
                SPARK_LOCAL_IP on the master to 127.0.0.1. This
                allows the master to successfully send out the
                jobs; however, it ends up canceling the stage
                after running this command several times:

                14/06/25 21:00:47 INFO AppClient$ClientActor:
                Executor added: app-20140625210032-0000/8 on
                worker-20140625205623-machine2-53597
                (machine2:53597) with 8 cores
                14/06/25 21:00:47 INFO
                SparkDeploySchedulerBackend: Granted executor ID
                app-20140625210032-0000/8 on hostPort
                machine2:53597 with 8 cores, 8.0 GB RAM
                14/06/25 21:00:47 INFO AppClient$ClientActor:
                Executor updated: app-20140625210032-0000/8 is now
                RUNNING
                14/06/25 21:00:49 INFO AppClient$ClientActor:
                Executor updated: app-20140625210032-0000/8 is now
                FAILED (Command exited with code 1)

                The "/8" started at "/1", eventually becomes "/9",
                and then "/10", at which point the program
                crashes. The worker on machine2 shows similar
                messages in its logs. Here are the last bunch:

                14/06/25 21:00:31 INFO Worker: Executor
                app-20140625210032-0000/9 finished with state
                FAILED message Command exited with code 1 exitStatus 1
                14/06/25 21:00:31 INFO Worker: Asked to launch
                executor app-20140625210032-0000/10 for app_name
                Spark assembly has been built with Hive, including
                Datanucleus jars on classpath
                14/06/25 21:00:32 INFO ExecutorRunner: Launch
                command: "java" "-cp"
                
"::/home/spark/spark-1.0.0-bin-hadoop2/conf:/home/spark/spark-1.0.0-bin-hadoop2/lib/spark-assembly-1.0.0-hadoop2.2.0.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-rdbms-3.2.1.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-core-3.2.2.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-api-jdo-3.2.1.jar"
                "-XX:MaxPermSize=128m" "-Xms8192M" "-Xmx8192M"
                "org.apache.spark.executor.CoarseGrainedExecutorBackend"
                "*akka.tcp://spark@localhost:5060/user/CoarseGrainedScheduler*"
                "10" "machine2" "8"
                "akka.tcp://sparkWorker@machine2:53597/user/Worker" 
"app-20140625210032-0000"
                14/06/25 21:00:33 INFO Worker: Executor
                app-20140625210032-0000/10 finished with state
                FAILED message Command exited with code 1 exitStatus 1

                I highlighted the part that seemed strange to me;
                that's the master port number (I set it to 5060),
                and yet it's referencing localhost? Is this the
                reason why machine2 apparently can't seem to give
                a confirmation to the master once the job is
                submitted? (The logs from the worker on the master
                node indicate that it's running just fine)

                I appreciate any assistance you can offer!

                Regards,
                Shannon Quinn

Re: Spark standalone network configuration problems

Reply via email to