To add to this: for this many (>= 20) machines I usually use at least --wait 600.
On Wed, Jul 30, 2014 at 9:10 AM, Nicholas Chammas <nicholas.cham...@gmail.com> wrote: > William, > > The error you are seeing is misleading. There is no need to terminate the > cluster and start over. > > Just re-run your launch command, but with the additional --resume option > tacked on the end. > > As Akhil explained, this happens because AWS is not starting up the > instances as quickly as the script is expecting. You can increase the wait > time to mitigate this problem. > > Nick > > > > On Wed, Jul 30, 2014 at 11:51 AM, Akhil Das <ak...@sigmoidanalytics.com> > wrote: >> >> You need to increase the wait time, (-w) the default is 120 seconds, you >> may set it to a higher number like 300-400. The problem is that EC2 takes >> some time to initiate the machine (which is > 120 seconds sometimes.) >> >> Thanks >> Best Regards >> >> >> On Wed, Jul 30, 2014 at 8:52 PM, William Cox >> <william....@distilnetworks.com> wrote: >>> >>> TL;DR >50% of the time I can't SSH into either my master or slave nodes >>> and have to terminate all the machines and restart the EC2 cluster setup >>> process. >>> >>> Hello, >>> >>> I'm trying to setup a Spark cluster on Amazon EC2. I am finding the setup >>> script to be delicate and unpredictable in terms of reliably allowing SSH >>> logins to all of the slaves and the master. For instance (I'm running Spark >>> 0.9.1-hadoop1. since I intend to use Shark. I call this command to provision >>> a 32 slave cluster using spot instances: >>> >>>> $./spark-ec2 --spot-price=0.1 --zone=us-east-1e -k key -i ~/key.pem -s >>>> 32 --instance-type=m1.medium launch cluster_name >>> >>> >>> After waiting for the instances to provision I get the following output: >>> >>>> All 32 slaves granted >>>> Launched master in us-east-1e, regid = r-f8444a89 >>>> Waiting for instances to start up... >>>> Waiting 120 more seconds... >>>> Generating cluster's SSH key on master... >>>> ssh: connect to host ecMASTER.compute-1.amazonaws.com port 22: >>>> Connection refused >>>> Error executing remote command, retrying after 30 seconds: Command >>>> '['ssh', '-o', 'StrictHostKeyChecking=no', '-i', '/Users/user/key.pem', >>>> '-t', '-t', u'r...@ecmaster.compute-1.amazonaws.com', "\n [ -f >>>> ~/.ssh/id_rsa ] ||\n (ssh-keygen -q -t rsa -N '' -f ~/.ssh/id_rsa >>>> &&\n cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys)\n "]' >>>> returned non-zero exit status 255 >>> >>> >>> I have removed the key and machine names with 'MASTER' and 'key'. I wait >>> a few more cycles of the error message and finally, after 3 attempts, the >>> script quits with this message: >>> >>> >>>> ssh: connect to host ecMASTER.compute-1.amazonaws.com port 22: >>>> Connection refused >>>> Error: >>>> Failed to SSH to remote host ecMASTER.compute-1.amazonaws.com. >>>> Please check that you have provided the correct --identity-file and >>>> --key-pair parameters and try again. >>> >>> So, YES, the .pem file is correct - I am currently running a smaller >>> cluster and can provision other machines on EC2 and use that file. Secondly, >>> the node it can't seem to connect to is the MASTER node. I have also gone >>> into the EC2 console and verified that all the machines are using the "key" >>> that corresponds to "key.pem". >>> >>> I have tried this command 2x and on a friends machine with no success. >>> However, I was able to provision a 15 machine cluster using m1.larges. >>> >>> Now I PAUSE for some period of time - 2-3 minutes (to write this email) - >>> and I call the same command with the "--resume" flag. This time it logs into >>> the master node just fine and begins to give the slaves SSH keys, and it >>> fails on a certain slave. >>>> >>>> ssh: connect to host ec2-54-237-6-95.compute-1.amazonaws.com port 22: >>>> Connection refused >>>> Error 255 while executing remote command, retrying after 30 seconds >>>> ssh: connect to host ec2-54-237-6-95.compute-1.amazonaws.com port 22: >>>> Connection refused >>>> Error 255 while executing remote command, retrying after 30 seconds >>>> ssh: connect to host ec2-54-237-6-95.compute-1.amazonaws.com port 22: >>>> Connection refused >>>> Error 255 while executing remote command, retrying after 30 seconds >>>> ssh: connect to host ec2-54-237-6-95.compute-1.amazonaws.com port 22: >>>> Connection refused >>>> Traceback (most recent call last): >>>> File "./spark_ec2.py", line 806, in <module> >>>> main() >>>> File "./spark_ec2.py", line 799, in main >>>> real_main() >>>> File "./spark_ec2.py", line 684, in real_main >>>> setup_cluster(conn, master_nodes, slave_nodes, opts, True) >>>> File "./spark_ec2.py", line 423, in setup_cluster >>>> ssh_write(slave.public_dns_name, opts, ['tar', 'x'], dot_ssh_tar) >>>> File "./spark_ec2.py", line 640, in ssh_write >>>> raise RuntimeError("ssh_write failed with error %s" % >>>> proc.returncode) >>>> RuntimeError: ssh_write failed with error 255 >>> >>> So I log into the EC2 console and TERMINATE that specific machine, and >>> re-resume. Now it finally appears to be installing software on the machines. >>> >>> Any ideas why certain machines refuse SSH connections, or why the master >>> refuses for several minutes and then allows? >>> >>> THanks. >>> >>> -William >>> >>> >> >