Hello spark users, (Apologies if a duplicate of this message just came through)
I am testing the behavior of remote job submission with ec2/spark_ec2.py in spark distribution 1.5.2. I submit SparkPi to a remote ec2 instance using spark-submit using the "standalone mode" (spark://) protocol. Connecting to the master via ssh works, but submission fails. The server logs report: Association with remote system [akka.tcp://sparkDriver@192.168.0.4:58498 <akka.tcp://sparkDriver@192.168.0.4:58498>] has failed Use case: run Zeppelin to develop test, and save code on local machine with spark local, but intermittently connect to an EC2 cluster to scale out. Thus ssh to the master first for job submission is not acceptable. Please find below a reproduction. My questions: 1) Is this kind of remote submission over standalone mode port 7077 supported? 2) What is the root cause of the protocol failure? 3) Is there a spark-env.sh or other server-side setting which will make the remote submission work? Regards, Jeff Henrikson # reproduction shown with: - "jq" json query - pyenv, virtualenv - awscli # set configuration export VPC=. . . # your VPC export SPARK_HOME=. . ./spark-1.5.2-bin-hadoop2.6 # Just the binary spark distribution 1.5.2 export IP4_SOURCE=. . . # the IP of the gateway for internet access export KP=. . . # the name of a keypair # throughout, cluster is named "cluster2" # region is us-west-2 # keypair given is ~/.ssh/$KP and registered in us-west-2 as $KP # setup python/virtualenv pushd $SPARK_HOME pyenv local 2.7.6 cd $SPARK_HOME/ec2 virtualenv ../venv ../venv/bin/pip install awscli # launch cluster ../venv/bin/python spark_ec2.py --vpc-id=$VPC --region=us-west-2 --instance-type=t2.medium --key-pair=$KP -i ~/.ssh/$KP launch cluster2 # authorize firewall port 7077 SG_MASTER=$(../venv/bin/aws ec2 describe-security-groups | jq -r '.SecurityGroups[] | select(.["GroupName"] == "cluster2-master") | .GroupId') ../venv/bin/aws ec2 authorize-security-group-ingress --group-id $SG_MASTER --protocol tcp --port 7077 --cidr $IP4_SOURCE/32 # verify connectivity to master port 7077 nc -v $DNS_MASTER 7077 ec2-. . . 7077 open # locate ec2 public dns name export DNS_MASTER=$(../venv/bin/aws ec2 describe-instances | jq -r '.Reservations[].Instances[] | select(.SecurityGroups[].GroupName == "cluster2-master") | .PublicDnsName') # submit job SPARK_HOME/bin/spark-submit --master spark://$DNS_MASTER:7077 --driver-memory 1g --executor-memory 1g --executor-cores 1 --class org.apache.spark.examples.SparkPi $SPARK_HOME/lib/spark-examples-1.5.2-hadoop2.6.0.jar # actual result: # expected result: Pi is approximately . . . # actual result: # logs on client: 16/02/23 12:28:36 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160223201742-0000/21 on hostPort 172.31.13.146:40392 with 1 cores, 1024.0 MB RAM 16/02/23 12:28:36 INFO AppClient$ClientEndpoint: Executor updated: app-20160223201742-0000/20 is now LOADING 16/02/23 12:28:36 INFO AppClient$ClientEndpoint: Executor updated: app-20160223201742-0000/21 is now LOADING 16/02/23 12:28:36 INFO AppClient$ClientEndpoint: Executor updated: app-20160223201742-0000/20 is now RUNNING 16/02/23 12:28:36 INFO AppClient$ClientEndpoint: Executor updated: app-20160223201742-0000/21 is now RUNNING 16/02/23 12:28:43 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources # tail logs on server: ssh -i ~/.ssh/$KP root@$DNS_MASTER sudo tail -f -n0 /root/spark/logs/* 16/02/23 20:42:42 INFO Master: 192.168.0.4:58498 got disassociated, removing it. 16/02/23 20:42:42 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkDriver@192.168.0.4:58498] has failed, address is now gated for [5000] ms. Reason is: [Disassociated]. # destroy cluster ../venv/bin/python spark_ec2.py --vpc-id=$VPC --region=us-west-2 --instance-type=t2.medium --key-pair=$KP -i ~/.ssh/$KP destroy cluster2