*TL;DR >50% of the time I can't SSH into either my master or slave nodes and have to terminate all the machines and restart the EC2 cluster setup process.*
Hello, I'm trying to setup a Spark cluster on Amazon EC2. I am finding the setup script to be delicate and unpredictable in terms of reliably allowing SSH logins to all of the slaves and the master. For instance (I'm running Spark 0.9.1-hadoop1. since I intend to use Shark. I call this command to provision a 32 slave cluster using spot instances: $./spark-ec2 --spot-price=0.1 --zone=us-east-1e -k key -i ~/key.pem -s 32 > --instance-type=m1.medium launch cluster_name After waiting for the instances to provision I get the following output: All 32 slaves granted > Launched master in us-east-1e, regid = r-f8444a89 > Waiting for instances to start up... > Waiting 120 more seconds... > Generating cluster's SSH key on master... > ssh: connect to host ecMASTER.compute-1.amazonaws.com port 22: Connection > refused > Error executing remote command, retrying after 30 seconds: Command > '['ssh', '-o', 'StrictHostKeyChecking=no', '-i', '/Users/user/key.pem', > '-t', '-t', u'r...@ecmaster.compute-1.amazonaws.com', "\n [ -f > ~/.ssh/id_rsa ] ||\n (ssh-keygen -q -t rsa -N '' -f ~/.ssh/id_rsa > &&\n cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys)\n "]' > returned non-zero exit status 255 I have removed the key and machine names with 'MASTER' and 'key'. I wait a few more cycles of the error message and finally, after 3 attempts, the script quits with this message: ssh: connect to host ecMASTER.compute-1.amazonaws.com port 22: Connection > refused > Error: > Failed to SSH to remote host ecMASTER.compute-1.amazonaws.com. > Please check that you have provided the correct --identity-file and > --key-pair parameters and try again. So, YES, the .pem file is correct - I am currently running a smaller cluster and can provision other machines on EC2 and use that file. Secondly, the node it can't seem to connect to is the MASTER node. I have also gone into the EC2 console and verified that all the machines are using the "key" that corresponds to "key.pem". I have tried this command 2x and on a friends machine with no success. However, I was able to provision a 15 machine cluster using m1.larges. Now I PAUSE for some period of time - 2-3 minutes (to write this email) - and I call the same command with the "--resume" flag. This time it logs into the master node just fine and begins to give the slaves SSH keys, and it fails on a certain slave. > ssh: connect to host ec2-54-237-6-95.compute-1.amazonaws.com port 22: > Connection refused > Error 255 while executing remote command, retrying after 30 seconds > ssh: connect to host ec2-54-237-6-95.compute-1.amazonaws.com port 22: > Connection refused > Error 255 while executing remote command, retrying after 30 seconds > ssh: connect to host ec2-54-237-6-95.compute-1.amazonaws.com port 22: > Connection refused > Error 255 while executing remote command, retrying after 30 seconds > ssh: connect to host ec2-54-237-6-95.compute-1.amazonaws.com port 22: > Connection refused > Traceback (most recent call last): > File "./spark_ec2.py", line 806, in <module> > main() > File "./spark_ec2.py", line 799, in main > real_main() > File "./spark_ec2.py", line 684, in real_main > setup_cluster(conn, master_nodes, slave_nodes, opts, True) > File "./spark_ec2.py", line 423, in setup_cluster > ssh_write(slave.public_dns_name, opts, ['tar', 'x'], dot_ssh_tar) > File "./spark_ec2.py", line 640, in ssh_write > raise RuntimeError("ssh_write failed with error %s" % proc.returncode) > RuntimeError: ssh_write failed with error 255 So I log into the EC2 console and TERMINATE that specific machine, and re-resume. Now it finally appears to be installing software on the machines. Any ideas why certain machines refuse SSH connections, or why the master refuses for several minutes and then allows? THanks. -William