[jira] [Commented] (LIBCLOUD-532) deploy_node(..) occasionally fails on EC2

Matthew Lehman (JIRA) Fri, 28 Mar 2014 09:56:09 -0700

    [ 
https://issues.apache.org/jira/browse/LIBCLOUD-532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13951010#comment-13951010
 ]


Matthew Lehman commented on LIBCLOUD-532:
-----------------------------------------

This is also an issue with Rackspace and appears to happen much more 
frequently. Paramiko's exception and the server logs leave you a bit lost here 
as well because it just results in :
 fatal: No supported key exchange algorithms [preauth]
Received signal 15; terminating.

on the server side and:

DEBUG:paramiko.transport:EOF in transport thread
DEBUG:paramiko.transport:Traceback (most recent call last):
DEBUG:paramiko.transport:  File 
"/usr/local/lib/python2.7/dist-packages/paramiko/transport.py", line 1426, in 
run
DEBUG:paramiko.transport:    ptype, m = self.packetizer.read_message()
DEBUG:paramiko.transport:  File 
"/usr/local/lib/python2.7/dist-packages/paramiko/packet.py", line 335, in 
read_message
DEBUG:paramiko.transport:    header = self.read_all(self.__block_size_in, 
check_rekey=True)
DEBUG:paramiko.transport:  File 
"/usr/local/lib/python2.7/dist-packages/paramiko/packet.py", line 230, in 
read_all
DEBUG:paramiko.transport:    raise EOFError()

from Paramiko. We extended the wait_period manually but adding retry logic here 
for the connect, and each of the sftp calls makes sense. Otherwise the external 
deploy_node retry just results in having multiple nodes with the same exact 
race condition. 

> deploy_node(..) occasionally fails on EC2
> -----------------------------------------
>
>                 Key: LIBCLOUD-532
>                 URL: https://issues.apache.org/jira/browse/LIBCLOUD-532
>             Project: Libcloud
>          Issue Type: Bug
>          Components: Compute
>         Environment: apache-libcloud 0.14.1, Windows 7
>            Reporter: Stefan Müller
>
> h2. Observed behaviour:
> When I'm starting EC2 nodes with {{deploy_node(ssh_key=...)}} I occationally 
> (about 50% of the time) get a an error message indicating that my key is not 
> a valid DSA key.
> This seems a bit odd, since I'm using an RSA key. 
> h2. Cause
> Turns out the cause is somewhere else:
> When starting a node, there is a short time during which the SSH daemon is 
> already up and running, but the public-key has not yet been put into the 
> `authorized_keys` file. Apparently the SSH daemon is started before Amazon's 
> key-injection magic has finished.
> During this short time (I'd guess about a second) SSH is rejecting the 
> private key, with an authentication error.
> libcloud then tries some other means of authentication during which it 
> apparently tries to parse the key as a DSA key, causing the reported error.
> Note that the extra-long timeout used for the SSH connection attempt is not 
> helping in this case, since the SSH server is replying already.
> h2. Suggested Fix
> I suggest to react to a failed authentication with a few retries, with a 
> second or two delay between them. Similarly to {{wait_until_running()}}.
> h2. Workaround
> {code}
> deploy_node(...,ssh_alternate_usernames=["root" for _ in range(10)])
> {code}
> This causes libcloud to make several authentification attempts. It is slow 
> enough to delay until the public-key is in place. Solves the problem 
> reliably, but not elegantly :)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (LIBCLOUD-532) deploy_node(..) occasionally fails on EC2

Reply via email to