[Yahoo-eng-team] [Bug 1349617] Re: SSHException: Error reading SSH protocol banner[Errno 104] Connection reset by peer

Salvatore Orlando Thu, 04 Sep 2014 17:22:24 -0700

So this is what I found out.

Instance log from a failing istance [1]. The important bit there is
"cirros-apply-local already run per instance", and not "no userdata for
datasource" as initially thought. That was just me being stupid and
thinking the public key was part of user data. That was really silly.

"cirros-apply-local already run per instance" seems to appear in the console
log for all SSH protocol banner failures [2]. the presence of duplicates makes
it difficult to prove correlation with SSH protocol banner failures.
However, they key here is that local testing revealing that when the SSH
connection fails there is no authorized_keys file in /home/cirros/.ssh. This
obviously explains the authentication failure. Whether the subsequent SSH
protocol banner errors are due to the cited MTU problems or else it has to be
clarified yet.
What is certain is that cirros processes the data source containing the public
SSH key before starting sshd. So the auth failures cannot be due to the init
process not being yet complete.

The cirros initialization process executes a set of steps on an instance basis.
This steps include setting public ssh keys.
"On an instance basis " means that these steps are not executed at each boot
but once per instance.

cirros-apply local [3] is the step which processes, among other things, ssh
public keys.
It is called by the cirros-per scripts [4], which at the end of its execution
writes a marker file [5]. The cirros-per process will terminate if when
executed the marker file is already present [6]

During the failing test it has been observed the following:

from the console log:
[ 3.696172] rtc_cmos 00:01: setting system clock to 2014-09-04 19:05:27 UTC
(1409857527)

from the cirros-apply marker directory:
$ ls -le /var/lib/cirros/sem/
total 3
-rw-r--r-- 1 root root 35 Thu Sep 4 13:06:28 2014
instance.197ce1ac-e2df-4d3a-b392-4803383ddf74.check-version
-rw-r--r-- 1 root root 22 Thu Sep 4 13:05:07 2014
instance.197ce1ac-e2df-4d3a-b392-4803383ddf74.cirros-apply-local
-rw-r--r-- 1 root root 24 Thu Sep 4 13:06:31 2014
instance.197ce1ac-e2df-4d3a-b392-4803383ddf74.userdata

as cirros defaults to MDT (UTC -6), this means the apply-local marker has been
applied BEFORE instance boot.
This is consistent with the situation we're seeing where the failure always
occur after events such as resize or stop.
The ssh public key should be applied in the first boot of the VM. When it's
restarted the process is skipped as the key should already be there.
Unfortunately the key isn't there, which is a bit of a mystery, especially
since the instance is powered off in a graceful way thanks to [7].

Nevertheless when an instance receives a shutdown signal it sends a TERM signal
to all processes. Meaning that the apply-local spawned by cirros-per at [4] can
be killed before it actually writes the key.
However, cirros-per even if it retrieves the return code it writes the marker
in any case [5].
This creates the conditions for a situation where the marker can be present
without having actually completed the apply-local phase. As a result it is
possible to have guests without SSH public key which manifest the failure
reported in this bug.

Why is this happening only recently.
It seems a paradox, but [7] might be the reason.
This patch (and a few similar others) introduced soft instance shutdown. Soft
instance shutdown avoid the abrupt shutdown which can actually leave the
cirros-init process incomplete.
However, since cirros-per writes the marker regardless whether the process it
called terminated with 0 or other code, it does not guarantee a successful
completion.

On the other hand, introducing [7] added a delay before stopping the
instance. For instance in the case [8] it took 13 seconds. Previously
tempest was just immediately powering off the instance, not giving it a
chance to run cirros-init. Now, with the added dealy, the cirros-init
process is executed and this might be the reason that this failure,
which was previously occasional, has recently become the biggest gate
breaker.

What are the possible solutions?
1) fix cirros-per to not write the marker if the called process returned a
non-zero value. The feasibility of this depends on wheter cirros-apply can be
considered idempotent
2) adjust tempest to wait a little after the instance become ACTIVE. It could
wait a fixed amount of time just to ensure instance initialization is completed.
3) Your proposal here.

[1] http://paste.openstack.org/show/106049/
[2]
http://logstash.openstack.org/#eyJzZWFyY2giOiJtZXNzYWdlOlwiY2lycm9zLWFwcGx5LWxvY2FsIGFscmVhZHkgcnVuIHBlciBpbnN0YW5jZVwiICBBTkQgdGFnczpjb25zb2xlIiwiZmllbGRzIjpbXSwib2Zmc2V0IjowLCJ0aW1lZnJhbWUiOiIxNzI4MDAiLCJncmFwaG1vZGUiOiJjb3VudCIsInRpbWUiOnsidXNlcl9pbnRlcnZhbCI6MH0sInN0YW1wIjoxNDA5ODc0NDg4NDM0fQ==
[3]
http://bazaar.launchpad.net/~cirros-dev/cirros/0.3/view/head:/src/sbin/cirros-apply
[4]
http://bazaar.launchpad.net/~cirros-dev/cirros/0.3/view/head:/src/bin/cirros-per#L102
[5]
http://bazaar.launchpad.net/~cirros-dev/cirros/0.3/view/head:/src/bin/cirros-per#L106
[6]
http://bazaar.launchpad.net/~cirros-dev/cirros/0.3/view/head:/src/bin/cirros-per#L92
[7]
http://git.openstack.org/cgit/openstack/nova/diff/nova/virt/libvirt/driver.py?id=c07ed15415c0ec3c5862f437f440632eff1e94df
[8]
http://logs.openstack.org/95/117695/5/gate/gate-tempest-dsvm-neutron-full/d28fe19/logs/screen-n-cpu.txt.gz#_2014-09-02_16_36_04_542

** Also affects: cirros
Importance: Undecided
Status: New

--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1349617

Title:
SSHException: Error reading SSH protocol banner[Errno 104] Connection
reset by peer

Status in CirrOS a tiny cloud guest:
New
Status in Grenade - OpenStack upgrade testing:
New
Status in OpenStack Neutron (virtual network service):
New
Status in OpenStack Compute (Nova):
Confirmed
Status in Tempest:
New

Bug description:
Noticed a drop in categorized bugs on grenade jobs, so looking at
latest I see this:

http://logs.openstack.org/63/108363/5/gate/gate-grenade-dsvm-partial-
ncpu/1458072/console.html

Running this query:

message:"Failed to establish authenticated ssh connection to cirros@"
AND message:"(Error reading SSH protocol banner[Errno 104] Connection
reset by peer). Number attempts: 18. Retry after 19 seconds." AND
tags:"console"

http://logstash.openstack.org/#eyJzZWFyY2giOiJtZXNzYWdlOlwiRmFpbGVkIHRvIGVzdGFibGlzaCBhdXRoZW50aWNhdGVkIHNzaCBjb25uZWN0aW9uIHRvIGNpcnJvc0BcIiBBTkQgbWVzc2FnZTpcIihFcnJvciByZWFkaW5nIFNTSCBwcm90b2NvbCBiYW5uZXJbRXJybm8gMTA0XSBDb25uZWN0aW9uIHJlc2V0IGJ5IHBlZXIpLiBOdW1iZXIgYXR0ZW1wdHM6IDE4LiBSZXRyeSBhZnRlciAxOSBzZWNvbmRzLlwiIEFORCB0YWdzOlwiY29uc29sZVwiIiwiZmllbGRzIjpbXSwib2Zmc2V0IjowLCJ0aW1lZnJhbWUiOiI2MDQ4MDAiLCJncmFwaG1vZGUiOiJjb3VudCIsInRpbWUiOnsidXNlcl9pbnRlcnZhbCI6MH0sInN0YW1wIjoxNDA2NTkwMTEwMzMyLCJtb2RlIjoiIiwiYW5hbHl6ZV9maWVsZCI6IiJ9

I get 28 hits in 7 days, and it seems to be very particular to grenade
jobs.

To manage notifications about this bug go to:
https://bugs.launchpad.net/cirros/+bug/1349617/+subscriptions

--
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help : https://help.launchpad.net/ListHelp

[Yahoo-eng-team] [Bug 1349617] Re: SSHException: Error reading SSH protocol banner[Errno 104] Connection reset by peer

Reply via email to