Ah, thanks for the help! That worked great.

On Wed, Jul 30, 2014 at 10:31 AM, Zongheng Yang <zonghen...@gmail.com>
wrote:

> To add to this: for this many (>= 20) machines I usually use at least
> --wait 600.
>
> On Wed, Jul 30, 2014 at 9:10 AM, Nicholas Chammas
> <nicholas.cham...@gmail.com> wrote:
> > William,
> >
> > The error you are seeing is misleading. There is no need to terminate the
> > cluster and start over.
> >
> > Just re-run your launch command, but with the additional --resume option
> > tacked on the end.
> >
> > As Akhil explained, this happens because AWS is not starting up the
> > instances as quickly as the script is expecting. You can increase the
> wait
> > time to mitigate this problem.
> >
> > Nick
> >
> >
> >
> > On Wed, Jul 30, 2014 at 11:51 AM, Akhil Das <ak...@sigmoidanalytics.com>
> > wrote:
> >>
> >> You need to increase the wait time, (-w) the default is 120 seconds, you
> >> may set it to a higher number like 300-400. The problem is that EC2
> takes
> >> some time to initiate the machine (which is > 120 seconds sometimes.)
> >>
> >> Thanks
> >> Best Regards
> >>
> >>
> >> On Wed, Jul 30, 2014 at 8:52 PM, William Cox
> >> <william....@distilnetworks.com> wrote:
> >>>
> >>> TL;DR >50% of the time I can't SSH into either my master or slave nodes
> >>> and have to terminate all the machines and restart the EC2 cluster
> setup
> >>> process.
> >>>
> >>> Hello,
> >>>
> >>> I'm trying to setup a Spark cluster on Amazon EC2. I am finding the
> setup
> >>> script to be delicate and unpredictable in terms of reliably allowing
> SSH
> >>> logins to all of the slaves and the master. For instance (I'm running
> Spark
> >>> 0.9.1-hadoop1. since I intend to use Shark. I call this command to
> provision
> >>> a 32 slave cluster using spot instances:
> >>>
> >>>> $./spark-ec2 --spot-price=0.1 --zone=us-east-1e -k key -i ~/key.pem -s
> >>>> 32 --instance-type=m1.medium launch cluster_name
> >>>
> >>>
> >>>  After waiting for the instances to provision I get the following
> output:
> >>>
> >>>> All 32 slaves granted
> >>>> Launched master in us-east-1e, regid = r-f8444a89
> >>>> Waiting for instances to start up...
> >>>> Waiting 120 more seconds...
> >>>> Generating cluster's SSH key on master...
> >>>> ssh: connect to host ecMASTER.compute-1.amazonaws.com port 22:
> >>>> Connection refused
> >>>> Error executing remote command, retrying after 30 seconds: Command
> >>>> '['ssh', '-o', 'StrictHostKeyChecking=no', '-i',
> '/Users/user/key.pem',
> >>>> '-t', '-t', u'r...@ecmaster.compute-1.amazonaws.com', "\n      [ -f
> >>>> ~/.ssh/id_rsa ] ||\n        (ssh-keygen -q -t rsa -N '' -f
> ~/.ssh/id_rsa
> >>>> &&\n         cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys)\n    "]'
> >>>> returned non-zero exit status 255
> >>>
> >>>
> >>> I have removed the key and machine names with 'MASTER' and 'key'. I
> wait
> >>> a few more cycles of the error message and finally, after 3 attempts,
> the
> >>> script quits with this message:
> >>>
> >>>
> >>>> ssh: connect to host ecMASTER.compute-1.amazonaws.com port 22:
> >>>> Connection refused
> >>>> Error:
> >>>> Failed to SSH to remote host ecMASTER.compute-1.amazonaws.com.
> >>>> Please check that you have provided the correct --identity-file and
> >>>> --key-pair parameters and try again.
> >>>
> >>> So, YES, the .pem file is correct - I am currently running a smaller
> >>> cluster and can provision other machines on EC2 and use that file.
> Secondly,
> >>> the node it can't seem to connect to is the MASTER node. I have also
> gone
> >>> into the EC2 console and verified that all the machines are using the
> "key"
> >>> that corresponds to "key.pem".
> >>>
> >>> I have tried this command 2x and on a friends machine with no success.
> >>> However, I was able to provision a 15 machine cluster using m1.larges.
> >>>
> >>> Now I PAUSE for some period of time - 2-3 minutes (to write this
> email) -
> >>> and I call the same command with the "--resume" flag. This time it
> logs into
> >>> the master node just fine and begins to give the slaves SSH keys, and
> it
> >>> fails on a certain slave.
> >>>>
> >>>> ssh: connect to host ec2-54-237-6-95.compute-1.amazonaws.com port 22:
> >>>> Connection refused
> >>>> Error 255 while executing remote command, retrying after 30 seconds
> >>>> ssh: connect to host ec2-54-237-6-95.compute-1.amazonaws.com port 22:
> >>>> Connection refused
> >>>> Error 255 while executing remote command, retrying after 30 seconds
> >>>> ssh: connect to host ec2-54-237-6-95.compute-1.amazonaws.com port 22:
> >>>> Connection refused
> >>>> Error 255 while executing remote command, retrying after 30 seconds
> >>>> ssh: connect to host ec2-54-237-6-95.compute-1.amazonaws.com port 22:
> >>>> Connection refused
> >>>> Traceback (most recent call last):
> >>>>   File "./spark_ec2.py", line 806, in <module>
> >>>>     main()
> >>>>   File "./spark_ec2.py", line 799, in main
> >>>>     real_main()
> >>>>   File "./spark_ec2.py", line 684, in real_main
> >>>>     setup_cluster(conn, master_nodes, slave_nodes, opts, True)
> >>>>   File "./spark_ec2.py", line 423, in setup_cluster
> >>>>     ssh_write(slave.public_dns_name, opts, ['tar', 'x'], dot_ssh_tar)
> >>>>   File "./spark_ec2.py", line 640, in ssh_write
> >>>>     raise RuntimeError("ssh_write failed with error %s" %
> >>>> proc.returncode)
> >>>> RuntimeError: ssh_write failed with error 255
> >>>
> >>> So I log into the EC2 console and TERMINATE that specific machine, and
> >>> re-resume. Now it finally appears to be installing software on the
> machines.
> >>>
> >>> Any ideas why certain machines refuse SSH connections, or why the
> master
> >>> refuses for several minutes and then allows?
> >>>
> >>> THanks.
> >>>
> >>> -William
> >>>
> >>>
> >>
> >
>

Reply via email to