*TL;DR >50% of the time I can't SSH into either my master or slave nodes
and have to terminate all the machines and restart the EC2 cluster setup
process.*

Hello,

I'm trying to setup a Spark cluster on Amazon EC2. I am finding the setup
script to be delicate and unpredictable in terms of reliably allowing SSH
logins to all of the slaves and the master. For instance (I'm running Spark
0.9.1-hadoop1. since I intend to use Shark. I call this command to
provision a 32 slave cluster using spot instances:

$./spark-ec2 --spot-price=0.1 --zone=us-east-1e -k key -i ~/key.pem -s 32
> --instance-type=m1.medium launch cluster_name


 After waiting for the instances to provision I get the following output:

All 32 slaves granted
> Launched master in us-east-1e, regid = r-f8444a89
> Waiting for instances to start up...
> Waiting 120 more seconds...
> Generating cluster's SSH key on master...
> ssh: connect to host ecMASTER.compute-1.amazonaws.com port 22: Connection
> refused
> Error executing remote command, retrying after 30 seconds: Command
> '['ssh', '-o', 'StrictHostKeyChecking=no', '-i', '/Users/user/key.pem',
> '-t', '-t', u'r...@ecmaster.compute-1.amazonaws.com', "\n      [ -f
> ~/.ssh/id_rsa ] ||\n        (ssh-keygen -q -t rsa -N '' -f ~/.ssh/id_rsa
> &&\n         cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys)\n    "]'
> returned non-zero exit status 255


I have removed the key and machine names with 'MASTER' and 'key'. I wait a
few more cycles of the error message and finally, after 3 attempts, the
script quits with this message:


ssh: connect to host ecMASTER.compute-1.amazonaws.com port 22: Connection
> refused
> Error:
> Failed to SSH to remote host ecMASTER.compute-1.amazonaws.com.
> Please check that you have provided the correct --identity-file and
> --key-pair parameters and try again.

So, YES, the .pem file is correct - I am currently running a smaller
cluster and can provision other machines on EC2 and use that file.
Secondly, the node it can't seem to connect to is the MASTER node. I have
also gone into the EC2 console and verified that all the machines are using
the "key" that corresponds to "key.pem".

I have tried this command 2x and on a friends machine with no success.
However, I was able to provision a 15 machine cluster using m1.larges.

Now I PAUSE for some period of time - 2-3 minutes (to write this email) -
and I call the same command with the "--resume" flag. This time it logs
into the master node just fine and begins to give the slaves SSH keys, and
it fails on a certain slave.

> ssh: connect to host ec2-54-237-6-95.compute-1.amazonaws.com port 22:
> Connection refused
> Error 255 while executing remote command, retrying after 30 seconds
> ssh: connect to host ec2-54-237-6-95.compute-1.amazonaws.com port 22:
> Connection refused
> Error 255 while executing remote command, retrying after 30 seconds
> ssh: connect to host ec2-54-237-6-95.compute-1.amazonaws.com port 22:
> Connection refused
> Error 255 while executing remote command, retrying after 30 seconds
> ssh: connect to host ec2-54-237-6-95.compute-1.amazonaws.com port 22:
> Connection refused
> Traceback (most recent call last):
>   File "./spark_ec2.py", line 806, in <module>
>     main()
>   File "./spark_ec2.py", line 799, in main
>     real_main()
>   File "./spark_ec2.py", line 684, in real_main
>     setup_cluster(conn, master_nodes, slave_nodes, opts, True)
>   File "./spark_ec2.py", line 423, in setup_cluster
>     ssh_write(slave.public_dns_name, opts, ['tar', 'x'], dot_ssh_tar)
>   File "./spark_ec2.py", line 640, in ssh_write
>     raise RuntimeError("ssh_write failed with error %s" % proc.returncode)
> RuntimeError: ssh_write failed with error 255

So I log into the EC2 console and TERMINATE that specific machine, and
re-resume. Now it finally appears to be installing software on the machines.

Any ideas why certain machines refuse SSH connections, or why the master
refuses for several minutes and then allows?

THanks.

-William

Reply via email to