Hello spark users,

(Apologies if a duplicate of this message just came through)

I am testing the behavior of remote job submission with ec2/spark_ec2.py in 
spark distribution 1.5.2.  I submit SparkPi to a remote ec2 instance using 
spark-submit using the "standalone mode" (spark://) protocol.  Connecting to 
the master via ssh works, but submission fails.  The server logs report:

Association with remote system [akka.tcp://sparkDriver@192.168.0.4:58498 
<akka.tcp://sparkDriver@192.168.0.4:58498>] has failed

Use case: run Zeppelin to develop test, and save code on local machine with 
spark local, but intermittently connect to an EC2 cluster to scale out.  Thus 
ssh to the master first for job submission is not acceptable.

Please find below a reproduction.

My questions:

1) Is this kind of remote submission over standalone mode port 7077 supported?
2) What is the root cause of the protocol failure?
3) Is there a spark-env.sh or other server-side setting which will make the 
remote submission work?

Regards,


Jeff Henrikson



    # reproduction shown with:
        - "jq" json query
        - pyenv, virtualenv
        - awscli

    # set configuration

        export VPC=. . .                                   # your VPC
        export SPARK_HOME=. . ./spark-1.5.2-bin-hadoop2.6  # Just the binary 
spark distribution 1.5.2
        export IP4_SOURCE=. . .                            # the IP of the 
gateway for internet access
        export KP=. . .                                    # the name of a 
keypair
        # throughout, cluster is named "cluster2"
        # region is us-west-2
        # keypair given is ~/.ssh/$KP and registered in us-west-2 as $KP

    # setup python/virtualenv

        pushd $SPARK_HOME
        pyenv local 2.7.6
        cd $SPARK_HOME/ec2

        virtualenv ../venv

        ../venv/bin/pip install awscli

    # launch cluster
        ../venv/bin/python spark_ec2.py --vpc-id=$VPC --region=us-west-2 
--instance-type=t2.medium --key-pair=$KP -i ~/.ssh/$KP launch cluster2

    # authorize firewall port 7077

        SG_MASTER=$(../venv/bin/aws ec2 describe-security-groups | jq -r 
'.SecurityGroups[] | select(.["GroupName"] == "cluster2-master") | .GroupId')
        ../venv/bin/aws ec2 authorize-security-group-ingress --group-id 
$SG_MASTER --protocol tcp --port 7077 --cidr $IP4_SOURCE/32

    # verify connectivity to master port 7077
        nc -v $DNS_MASTER 7077
            ec2-. . . 7077 open

    # locate ec2 public dns name
        export DNS_MASTER=$(../venv/bin/aws ec2 describe-instances | jq -r 
'.Reservations[].Instances[] | select(.SecurityGroups[].GroupName == 
"cluster2-master") | .PublicDnsName')

    # submit job
        SPARK_HOME/bin/spark-submit --master spark://$DNS_MASTER:7077 
--driver-memory 1g     --executor-memory 1g     --executor-cores 1 --class 
org.apache.spark.examples.SparkPi 
$SPARK_HOME/lib/spark-examples-1.5.2-hadoop2.6.0.jar


    # actual result:
    # expected result:
        Pi is approximately . . .

    # actual result:
        # logs on client:

            16/02/23 12:28:36 INFO SparkDeploySchedulerBackend: Granted 
executor ID app-20160223201742-0000/21 on hostPort 172.31.13.146:40392 with 1 
cores, 1024.0 MB RAM
            16/02/23 12:28:36 INFO AppClient$ClientEndpoint: Executor updated: 
app-20160223201742-0000/20 is now LOADING
            16/02/23 12:28:36 INFO AppClient$ClientEndpoint: Executor updated: 
app-20160223201742-0000/21 is now LOADING
            16/02/23 12:28:36 INFO AppClient$ClientEndpoint: Executor updated: 
app-20160223201742-0000/20 is now RUNNING
            16/02/23 12:28:36 INFO AppClient$ClientEndpoint: Executor updated: 
app-20160223201742-0000/21 is now RUNNING
            16/02/23 12:28:43 WARN TaskSchedulerImpl: Initial job has not 
accepted any resources; check your cluster UI to ensure that workers are 
registered and have sufficient resources

        # tail logs on server:

        ssh -i ~/.ssh/$KP root@$DNS_MASTER
        sudo tail -f -n0 /root/spark/logs/*

            16/02/23 20:42:42 INFO Master: 192.168.0.4:58498 got disassociated, 
removing it.
            16/02/23 20:42:42 WARN ReliableDeliverySupervisor: Association with 
remote system [akka.tcp://sparkDriver@192.168.0.4:58498] has failed, address is 
now gated for [5000] ms. Reason is: [Disassociated].

    # destroy cluster
        ../venv/bin/python spark_ec2.py --vpc-id=$VPC --region=us-west-2 
--instance-type=t2.medium --key-pair=$KP -i ~/.ssh/$KP destroy cluster2



Reply via email to