William, The error you are seeing is misleading. There is no need to terminate the cluster and start over.
Just re-run your launch command, but with the additional --resume option tacked on the end. As Akhil explained, this happens because AWS is not starting up the instances as quickly as the script is expecting. You can increase the wait time to mitigate this problem. Nick On Wed, Jul 30, 2014 at 11:51 AM, Akhil Das <ak...@sigmoidanalytics.com> wrote: > You need to increase the wait time, (-w) the default is 120 seconds, you > may set it to a higher number like 300-400. The problem is that EC2 takes > some time to initiate the machine (which is > 120 seconds sometimes.) > > Thanks > Best Regards > > > On Wed, Jul 30, 2014 at 8:52 PM, William Cox < > william....@distilnetworks.com> wrote: > >> *TL;DR >50% of the time I can't SSH into either my master or slave nodes >> and have to terminate all the machines and restart the EC2 cluster setup >> process.* >> >> Hello, >> >> I'm trying to setup a Spark cluster on Amazon EC2. I am finding the setup >> script to be delicate and unpredictable in terms of reliably allowing SSH >> logins to all of the slaves and the master. For instance (I'm running Spark >> 0.9.1-hadoop1. since I intend to use Shark. I call this command to >> provision a 32 slave cluster using spot instances: >> >> $./spark-ec2 --spot-price=0.1 --zone=us-east-1e -k key -i ~/key.pem -s 32 >>> --instance-type=m1.medium launch cluster_name >> >> >> After waiting for the instances to provision I get the following output: >> >> All 32 slaves granted >>> Launched master in us-east-1e, regid = r-f8444a89 >>> Waiting for instances to start up... >>> Waiting 120 more seconds... >>> Generating cluster's SSH key on master... >>> ssh: connect to host ecMASTER.compute-1.amazonaws.com port 22: >>> Connection refused >>> Error executing remote command, retrying after 30 seconds: Command >>> '['ssh', '-o', 'StrictHostKeyChecking=no', '-i', '/Users/user/key.pem', >>> '-t', '-t', u'r...@ecmaster.compute-1.amazonaws.com', "\n [ -f >>> ~/.ssh/id_rsa ] ||\n (ssh-keygen -q -t rsa -N '' -f ~/.ssh/id_rsa >>> &&\n cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys)\n "]' >>> returned non-zero exit status 255 >> >> >> I have removed the key and machine names with 'MASTER' and 'key'. I wait >> a few more cycles of the error message and finally, after 3 attempts, the >> script quits with this message: >> >> >> ssh: connect to host ecMASTER.compute-1.amazonaws.com port 22: >>> Connection refused >>> Error: >>> Failed to SSH to remote host ecMASTER.compute-1.amazonaws.com. >>> Please check that you have provided the correct --identity-file and >>> --key-pair parameters and try again. >> >> So, YES, the .pem file is correct - I am currently running a smaller >> cluster and can provision other machines on EC2 and use that file. >> Secondly, the node it can't seem to connect to is the MASTER node. I have >> also gone into the EC2 console and verified that all the machines are using >> the "key" that corresponds to "key.pem". >> >> I have tried this command 2x and on a friends machine with no success. >> However, I was able to provision a 15 machine cluster using m1.larges. >> >> Now I PAUSE for some period of time - 2-3 minutes (to write this email) - >> and I call the same command with the "--resume" flag. This time it logs >> into the master node just fine and begins to give the slaves SSH keys, and >> it fails on a certain slave. >> >>> ssh: connect to host ec2-54-237-6-95.compute-1.amazonaws.com port 22: >>> Connection refused >>> Error 255 while executing remote command, retrying after 30 seconds >>> ssh: connect to host ec2-54-237-6-95.compute-1.amazonaws.com port 22: >>> Connection refused >>> Error 255 while executing remote command, retrying after 30 seconds >>> ssh: connect to host ec2-54-237-6-95.compute-1.amazonaws.com port 22: >>> Connection refused >>> Error 255 while executing remote command, retrying after 30 seconds >>> ssh: connect to host ec2-54-237-6-95.compute-1.amazonaws.com port 22: >>> Connection refused >>> Traceback (most recent call last): >>> File "./spark_ec2.py", line 806, in <module> >>> main() >>> File "./spark_ec2.py", line 799, in main >>> real_main() >>> File "./spark_ec2.py", line 684, in real_main >>> setup_cluster(conn, master_nodes, slave_nodes, opts, True) >>> File "./spark_ec2.py", line 423, in setup_cluster >>> ssh_write(slave.public_dns_name, opts, ['tar', 'x'], dot_ssh_tar) >>> File "./spark_ec2.py", line 640, in ssh_write >>> raise RuntimeError("ssh_write failed with error %s" % >>> proc.returncode) >>> RuntimeError: ssh_write failed with error 255 >> >> So I log into the EC2 console and TERMINATE that specific machine, and >> re-resume. Now it finally appears to be installing software on the machines. >> >> Any ideas why certain machines refuse SSH connections, or why the master >> refuses for several minutes and then allows? >> >> THanks. >> >> -William >> >> >> >