To add to this: for this many (>= 20) machines I usually use at least
--wait 600.

On Wed, Jul 30, 2014 at 9:10 AM, Nicholas Chammas
<nicholas.cham...@gmail.com> wrote:
> William,
>
> The error you are seeing is misleading. There is no need to terminate the
> cluster and start over.
>
> Just re-run your launch command, but with the additional --resume option
> tacked on the end.
>
> As Akhil explained, this happens because AWS is not starting up the
> instances as quickly as the script is expecting. You can increase the wait
> time to mitigate this problem.
>
> Nick
>
>
>
> On Wed, Jul 30, 2014 at 11:51 AM, Akhil Das <ak...@sigmoidanalytics.com>
> wrote:
>>
>> You need to increase the wait time, (-w) the default is 120 seconds, you
>> may set it to a higher number like 300-400. The problem is that EC2 takes
>> some time to initiate the machine (which is > 120 seconds sometimes.)
>>
>> Thanks
>> Best Regards
>>
>>
>> On Wed, Jul 30, 2014 at 8:52 PM, William Cox
>> <william....@distilnetworks.com> wrote:
>>>
>>> TL;DR >50% of the time I can't SSH into either my master or slave nodes
>>> and have to terminate all the machines and restart the EC2 cluster setup
>>> process.
>>>
>>> Hello,
>>>
>>> I'm trying to setup a Spark cluster on Amazon EC2. I am finding the setup
>>> script to be delicate and unpredictable in terms of reliably allowing SSH
>>> logins to all of the slaves and the master. For instance (I'm running Spark
>>> 0.9.1-hadoop1. since I intend to use Shark. I call this command to provision
>>> a 32 slave cluster using spot instances:
>>>
>>>> $./spark-ec2 --spot-price=0.1 --zone=us-east-1e -k key -i ~/key.pem -s
>>>> 32 --instance-type=m1.medium launch cluster_name
>>>
>>>
>>>  After waiting for the instances to provision I get the following output:
>>>
>>>> All 32 slaves granted
>>>> Launched master in us-east-1e, regid = r-f8444a89
>>>> Waiting for instances to start up...
>>>> Waiting 120 more seconds...
>>>> Generating cluster's SSH key on master...
>>>> ssh: connect to host ecMASTER.compute-1.amazonaws.com port 22:
>>>> Connection refused
>>>> Error executing remote command, retrying after 30 seconds: Command
>>>> '['ssh', '-o', 'StrictHostKeyChecking=no', '-i', '/Users/user/key.pem',
>>>> '-t', '-t', u'r...@ecmaster.compute-1.amazonaws.com', "\n      [ -f
>>>> ~/.ssh/id_rsa ] ||\n        (ssh-keygen -q -t rsa -N '' -f ~/.ssh/id_rsa
>>>> &&\n         cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys)\n    "]'
>>>> returned non-zero exit status 255
>>>
>>>
>>> I have removed the key and machine names with 'MASTER' and 'key'. I wait
>>> a few more cycles of the error message and finally, after 3 attempts, the
>>> script quits with this message:
>>>
>>>
>>>> ssh: connect to host ecMASTER.compute-1.amazonaws.com port 22:
>>>> Connection refused
>>>> Error:
>>>> Failed to SSH to remote host ecMASTER.compute-1.amazonaws.com.
>>>> Please check that you have provided the correct --identity-file and
>>>> --key-pair parameters and try again.
>>>
>>> So, YES, the .pem file is correct - I am currently running a smaller
>>> cluster and can provision other machines on EC2 and use that file. Secondly,
>>> the node it can't seem to connect to is the MASTER node. I have also gone
>>> into the EC2 console and verified that all the machines are using the "key"
>>> that corresponds to "key.pem".
>>>
>>> I have tried this command 2x and on a friends machine with no success.
>>> However, I was able to provision a 15 machine cluster using m1.larges.
>>>
>>> Now I PAUSE for some period of time - 2-3 minutes (to write this email) -
>>> and I call the same command with the "--resume" flag. This time it logs into
>>> the master node just fine and begins to give the slaves SSH keys, and it
>>> fails on a certain slave.
>>>>
>>>> ssh: connect to host ec2-54-237-6-95.compute-1.amazonaws.com port 22:
>>>> Connection refused
>>>> Error 255 while executing remote command, retrying after 30 seconds
>>>> ssh: connect to host ec2-54-237-6-95.compute-1.amazonaws.com port 22:
>>>> Connection refused
>>>> Error 255 while executing remote command, retrying after 30 seconds
>>>> ssh: connect to host ec2-54-237-6-95.compute-1.amazonaws.com port 22:
>>>> Connection refused
>>>> Error 255 while executing remote command, retrying after 30 seconds
>>>> ssh: connect to host ec2-54-237-6-95.compute-1.amazonaws.com port 22:
>>>> Connection refused
>>>> Traceback (most recent call last):
>>>>   File "./spark_ec2.py", line 806, in <module>
>>>>     main()
>>>>   File "./spark_ec2.py", line 799, in main
>>>>     real_main()
>>>>   File "./spark_ec2.py", line 684, in real_main
>>>>     setup_cluster(conn, master_nodes, slave_nodes, opts, True)
>>>>   File "./spark_ec2.py", line 423, in setup_cluster
>>>>     ssh_write(slave.public_dns_name, opts, ['tar', 'x'], dot_ssh_tar)
>>>>   File "./spark_ec2.py", line 640, in ssh_write
>>>>     raise RuntimeError("ssh_write failed with error %s" %
>>>> proc.returncode)
>>>> RuntimeError: ssh_write failed with error 255
>>>
>>> So I log into the EC2 console and TERMINATE that specific machine, and
>>> re-resume. Now it finally appears to be installing software on the machines.
>>>
>>> Any ideas why certain machines refuse SSH connections, or why the master
>>> refuses for several minutes and then allows?
>>>
>>> THanks.
>>>
>>> -William
>>>
>>>
>>
>

Reply via email to