Ah, thanks for the help! That worked great.
On Wed, Jul 30, 2014 at 10:31 AM, Zongheng Yang <zonghen...@gmail.com> wrote: > To add to this: for this many (>= 20) machines I usually use at least > --wait 600. > > On Wed, Jul 30, 2014 at 9:10 AM, Nicholas Chammas > <nicholas.cham...@gmail.com> wrote: > > William, > > > > The error you are seeing is misleading. There is no need to terminate the > > cluster and start over. > > > > Just re-run your launch command, but with the additional --resume option > > tacked on the end. > > > > As Akhil explained, this happens because AWS is not starting up the > > instances as quickly as the script is expecting. You can increase the > wait > > time to mitigate this problem. > > > > Nick > > > > > > > > On Wed, Jul 30, 2014 at 11:51 AM, Akhil Das <ak...@sigmoidanalytics.com> > > wrote: > >> > >> You need to increase the wait time, (-w) the default is 120 seconds, you > >> may set it to a higher number like 300-400. The problem is that EC2 > takes > >> some time to initiate the machine (which is > 120 seconds sometimes.) > >> > >> Thanks > >> Best Regards > >> > >> > >> On Wed, Jul 30, 2014 at 8:52 PM, William Cox > >> <william....@distilnetworks.com> wrote: > >>> > >>> TL;DR >50% of the time I can't SSH into either my master or slave nodes > >>> and have to terminate all the machines and restart the EC2 cluster > setup > >>> process. > >>> > >>> Hello, > >>> > >>> I'm trying to setup a Spark cluster on Amazon EC2. I am finding the > setup > >>> script to be delicate and unpredictable in terms of reliably allowing > SSH > >>> logins to all of the slaves and the master. For instance (I'm running > Spark > >>> 0.9.1-hadoop1. since I intend to use Shark. I call this command to > provision > >>> a 32 slave cluster using spot instances: > >>> > >>>> $./spark-ec2 --spot-price=0.1 --zone=us-east-1e -k key -i ~/key.pem -s > >>>> 32 --instance-type=m1.medium launch cluster_name > >>> > >>> > >>> After waiting for the instances to provision I get the following > output: > >>> > >>>> All 32 slaves granted > >>>> Launched master in us-east-1e, regid = r-f8444a89 > >>>> Waiting for instances to start up... > >>>> Waiting 120 more seconds... > >>>> Generating cluster's SSH key on master... > >>>> ssh: connect to host ecMASTER.compute-1.amazonaws.com port 22: > >>>> Connection refused > >>>> Error executing remote command, retrying after 30 seconds: Command > >>>> '['ssh', '-o', 'StrictHostKeyChecking=no', '-i', > '/Users/user/key.pem', > >>>> '-t', '-t', u'r...@ecmaster.compute-1.amazonaws.com', "\n [ -f > >>>> ~/.ssh/id_rsa ] ||\n (ssh-keygen -q -t rsa -N '' -f > ~/.ssh/id_rsa > >>>> &&\n cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys)\n "]' > >>>> returned non-zero exit status 255 > >>> > >>> > >>> I have removed the key and machine names with 'MASTER' and 'key'. I > wait > >>> a few more cycles of the error message and finally, after 3 attempts, > the > >>> script quits with this message: > >>> > >>> > >>>> ssh: connect to host ecMASTER.compute-1.amazonaws.com port 22: > >>>> Connection refused > >>>> Error: > >>>> Failed to SSH to remote host ecMASTER.compute-1.amazonaws.com. > >>>> Please check that you have provided the correct --identity-file and > >>>> --key-pair parameters and try again. > >>> > >>> So, YES, the .pem file is correct - I am currently running a smaller > >>> cluster and can provision other machines on EC2 and use that file. > Secondly, > >>> the node it can't seem to connect to is the MASTER node. I have also > gone > >>> into the EC2 console and verified that all the machines are using the > "key" > >>> that corresponds to "key.pem". > >>> > >>> I have tried this command 2x and on a friends machine with no success. > >>> However, I was able to provision a 15 machine cluster using m1.larges. > >>> > >>> Now I PAUSE for some period of time - 2-3 minutes (to write this > email) - > >>> and I call the same command with the "--resume" flag. This time it > logs into > >>> the master node just fine and begins to give the slaves SSH keys, and > it > >>> fails on a certain slave. > >>>> > >>>> ssh: connect to host ec2-54-237-6-95.compute-1.amazonaws.com port 22: > >>>> Connection refused > >>>> Error 255 while executing remote command, retrying after 30 seconds > >>>> ssh: connect to host ec2-54-237-6-95.compute-1.amazonaws.com port 22: > >>>> Connection refused > >>>> Error 255 while executing remote command, retrying after 30 seconds > >>>> ssh: connect to host ec2-54-237-6-95.compute-1.amazonaws.com port 22: > >>>> Connection refused > >>>> Error 255 while executing remote command, retrying after 30 seconds > >>>> ssh: connect to host ec2-54-237-6-95.compute-1.amazonaws.com port 22: > >>>> Connection refused > >>>> Traceback (most recent call last): > >>>> File "./spark_ec2.py", line 806, in <module> > >>>> main() > >>>> File "./spark_ec2.py", line 799, in main > >>>> real_main() > >>>> File "./spark_ec2.py", line 684, in real_main > >>>> setup_cluster(conn, master_nodes, slave_nodes, opts, True) > >>>> File "./spark_ec2.py", line 423, in setup_cluster > >>>> ssh_write(slave.public_dns_name, opts, ['tar', 'x'], dot_ssh_tar) > >>>> File "./spark_ec2.py", line 640, in ssh_write > >>>> raise RuntimeError("ssh_write failed with error %s" % > >>>> proc.returncode) > >>>> RuntimeError: ssh_write failed with error 255 > >>> > >>> So I log into the EC2 console and TERMINATE that specific machine, and > >>> re-resume. Now it finally appears to be installing software on the > machines. > >>> > >>> Any ideas why certain machines refuse SSH connections, or why the > master > >>> refuses for several minutes and then allows? > >>> > >>> THanks. > >>> > >>> -William > >>> > >>> > >> > > >