[ https://issues.apache.org/jira/browse/LIBCLOUD-532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13951010#comment-13951010 ]
Matthew Lehman commented on LIBCLOUD-532: ----------------------------------------- This is also an issue with Rackspace and appears to happen much more frequently. Paramiko's exception and the server logs leave you a bit lost here as well because it just results in : fatal: No supported key exchange algorithms [preauth] Received signal 15; terminating. on the server side and: DEBUG:paramiko.transport:EOF in transport thread DEBUG:paramiko.transport:Traceback (most recent call last): DEBUG:paramiko.transport: File "/usr/local/lib/python2.7/dist-packages/paramiko/transport.py", line 1426, in run DEBUG:paramiko.transport: ptype, m = self.packetizer.read_message() DEBUG:paramiko.transport: File "/usr/local/lib/python2.7/dist-packages/paramiko/packet.py", line 335, in read_message DEBUG:paramiko.transport: header = self.read_all(self.__block_size_in, check_rekey=True) DEBUG:paramiko.transport: File "/usr/local/lib/python2.7/dist-packages/paramiko/packet.py", line 230, in read_all DEBUG:paramiko.transport: raise EOFError() from Paramiko. We extended the wait_period manually but adding retry logic here for the connect, and each of the sftp calls makes sense. Otherwise the external deploy_node retry just results in having multiple nodes with the same exact race condition. > deploy_node(..) occasionally fails on EC2 > ----------------------------------------- > > Key: LIBCLOUD-532 > URL: https://issues.apache.org/jira/browse/LIBCLOUD-532 > Project: Libcloud > Issue Type: Bug > Components: Compute > Environment: apache-libcloud 0.14.1, Windows 7 > Reporter: Stefan Müller > > h2. Observed behaviour: > When I'm starting EC2 nodes with {{deploy_node(ssh_key=...)}} I occationally > (about 50% of the time) get a an error message indicating that my key is not > a valid DSA key. > This seems a bit odd, since I'm using an RSA key. > h2. Cause > Turns out the cause is somewhere else: > When starting a node, there is a short time during which the SSH daemon is > already up and running, but the public-key has not yet been put into the > `authorized_keys` file. Apparently the SSH daemon is started before Amazon's > key-injection magic has finished. > During this short time (I'd guess about a second) SSH is rejecting the > private key, with an authentication error. > libcloud then tries some other means of authentication during which it > apparently tries to parse the key as a DSA key, causing the reported error. > Note that the extra-long timeout used for the SSH connection attempt is not > helping in this case, since the SSH server is replying already. > h2. Suggested Fix > I suggest to react to a failed authentication with a few retries, with a > second or two delay between them. Similarly to {{wait_until_running()}}. > h2. Workaround > {code} > deploy_node(...,ssh_alternate_usernames=["root" for _ in range(10)]) > {code} > This causes libcloud to make several authentification attempts. It is slow > enough to delay until the public-key is in place. Solves the problem > reliably, but not elegantly :) -- This message was sent by Atlassian JIRA (v6.2#6252)