So this is what I found out.

Instance log from a failing istance [1]. The important bit there is
"cirros-apply-local already run per instance", and not "no userdata for
datasource" as initially thought. That was just me being stupid and
thinking the public key was part of user data. That was really silly.

"cirros-apply-local already run per instance" seems to appear in the console 
log for all SSH protocol banner failures [2]. the presence of duplicates makes 
it difficult to prove correlation with SSH protocol banner failures.
However, they key here is that local testing revealing that when the SSH 
connection fails there is no authorized_keys file in /home/cirros/.ssh. This 
obviously explains the authentication failure. Whether the subsequent SSH 
protocol banner errors are due to the cited MTU problems or else it has to be 
clarified yet.
What is certain is that cirros processes the data source containing the public 
SSH key before starting sshd. So the auth failures cannot be due to the init 
process not being yet complete.

The cirros initialization process executes a set of steps on an instance basis. 
This steps include setting public ssh keys.
"On an instance basis " means that these steps are not executed at each boot 
but once per instance.

cirros-apply local [3] is the step which processes, among other things, ssh 
public keys.
It is called by the cirros-per scripts [4], which at the end of its execution 
writes a marker file [5]. The cirros-per process will terminate if when 
executed the marker file is already present [6]

During the failing test it has been observed the following:

from the console log: 
[    3.696172] rtc_cmos 00:01: setting system clock to 2014-09-04 19:05:27 UTC 
(1409857527)

from the cirros-apply marker directory:
$ ls -le /var/lib/cirros/sem/
total 3
-rw-r--r--    1 root     root            35 Thu Sep  4 13:06:28 2014 
instance.197ce1ac-e2df-4d3a-b392-4803383ddf74.check-version
-rw-r--r--    1 root     root            22 Thu Sep  4 13:05:07 2014 
instance.197ce1ac-e2df-4d3a-b392-4803383ddf74.cirros-apply-local
-rw-r--r--    1 root     root            24 Thu Sep  4 13:06:31 2014 
instance.197ce1ac-e2df-4d3a-b392-4803383ddf74.userdata

as cirros defaults to MDT (UTC -6), this means the apply-local marker has been 
applied BEFORE instance boot.
This is consistent with the situation we're seeing where the failure always 
occur after events such as resize or stop.
The ssh public key should be applied in the first boot of the VM. When it's 
restarted the process is skipped as the key should already be there. 
Unfortunately the key isn't there, which is a bit of a mystery, especially 
since the instance is powered off in a graceful way thanks to [7].

Nevertheless when an instance receives a shutdown signal it sends a TERM signal 
to all processes. Meaning that the apply-local spawned by cirros-per at [4] can 
be killed before it actually writes the key.
However, cirros-per even if it retrieves the return code it writes the marker 
in any case [5]. 
This creates the conditions for a situation where the marker can be present 
without having actually completed the apply-local phase. As a result it is 
possible to have guests without SSH public key which manifest the failure 
reported in this bug.

Why is this happening only recently.
It seems a paradox, but [7] might be the reason.
This patch (and a few similar others) introduced soft instance shutdown. Soft 
instance shutdown avoid the abrupt shutdown which can actually leave the 
cirros-init process incomplete.
However, since cirros-per writes the marker regardless whether the process it 
called terminated with 0 or other code, it does not guarantee a successful 
completion.

On the other hand, introducing [7] added a delay before stopping the
instance. For instance in the case [8] it took 13 seconds. Previously
tempest was just immediately powering off the instance, not giving it a
chance to run cirros-init. Now, with the added dealy, the cirros-init
process is executed and this might be the reason that this failure,
which was previously occasional, has recently become the biggest gate
breaker.

What are the possible solutions?
1) fix cirros-per to not write the marker if the called process returned a 
non-zero value. The feasibility of this depends on wheter cirros-apply can be 
considered idempotent
2) adjust tempest to wait a little after the instance become ACTIVE. It could 
wait a fixed amount of time just to ensure instance initialization is completed.
3) Your proposal here.


[1] http://paste.openstack.org/show/106049/
[2] 
http://logstash.openstack.org/#eyJzZWFyY2giOiJtZXNzYWdlOlwiY2lycm9zLWFwcGx5LWxvY2FsIGFscmVhZHkgcnVuIHBlciBpbnN0YW5jZVwiICBBTkQgdGFnczpjb25zb2xlIiwiZmllbGRzIjpbXSwib2Zmc2V0IjowLCJ0aW1lZnJhbWUiOiIxNzI4MDAiLCJncmFwaG1vZGUiOiJjb3VudCIsInRpbWUiOnsidXNlcl9pbnRlcnZhbCI6MH0sInN0YW1wIjoxNDA5ODc0NDg4NDM0fQ==
[3] 
http://bazaar.launchpad.net/~cirros-dev/cirros/0.3/view/head:/src/sbin/cirros-apply
[4] 
http://bazaar.launchpad.net/~cirros-dev/cirros/0.3/view/head:/src/bin/cirros-per#L102
[5] 
http://bazaar.launchpad.net/~cirros-dev/cirros/0.3/view/head:/src/bin/cirros-per#L106
[6]  
http://bazaar.launchpad.net/~cirros-dev/cirros/0.3/view/head:/src/bin/cirros-per#L92
[7] 
http://git.openstack.org/cgit/openstack/nova/diff/nova/virt/libvirt/driver.py?id=c07ed15415c0ec3c5862f437f440632eff1e94df
[8] 
http://logs.openstack.org/95/117695/5/gate/gate-tempest-dsvm-neutron-full/d28fe19/logs/screen-n-cpu.txt.gz#_2014-09-02_16_36_04_542

** Also affects: cirros
   Importance: Undecided
       Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1349617

Title:
  SSHException: Error reading SSH protocol banner[Errno 104] Connection
  reset by peer

Status in CirrOS a tiny cloud guest:
  New
Status in Grenade - OpenStack upgrade testing:
  New
Status in OpenStack Neutron (virtual network service):
  New
Status in OpenStack Compute (Nova):
  Confirmed
Status in Tempest:
  New

Bug description:
  Noticed a drop in categorized bugs on grenade jobs, so looking at
  latest I see this:

  http://logs.openstack.org/63/108363/5/gate/gate-grenade-dsvm-partial-
  ncpu/1458072/console.html

  Running this query:

  message:"Failed to establish authenticated ssh connection to cirros@"
  AND message:"(Error reading SSH protocol banner[Errno 104] Connection
  reset by peer). Number attempts: 18. Retry after 19 seconds." AND
  tags:"console"

  
http://logstash.openstack.org/#eyJzZWFyY2giOiJtZXNzYWdlOlwiRmFpbGVkIHRvIGVzdGFibGlzaCBhdXRoZW50aWNhdGVkIHNzaCBjb25uZWN0aW9uIHRvIGNpcnJvc0BcIiBBTkQgbWVzc2FnZTpcIihFcnJvciByZWFkaW5nIFNTSCBwcm90b2NvbCBiYW5uZXJbRXJybm8gMTA0XSBDb25uZWN0aW9uIHJlc2V0IGJ5IHBlZXIpLiBOdW1iZXIgYXR0ZW1wdHM6IDE4LiBSZXRyeSBhZnRlciAxOSBzZWNvbmRzLlwiIEFORCB0YWdzOlwiY29uc29sZVwiIiwiZmllbGRzIjpbXSwib2Zmc2V0IjowLCJ0aW1lZnJhbWUiOiI2MDQ4MDAiLCJncmFwaG1vZGUiOiJjb3VudCIsInRpbWUiOnsidXNlcl9pbnRlcnZhbCI6MH0sInN0YW1wIjoxNDA2NTkwMTEwMzMyLCJtb2RlIjoiIiwiYW5hbHl6ZV9maWVsZCI6IiJ9

  I get 28 hits in 7 days, and it seems to be very particular to grenade
  jobs.

To manage notifications about this bug go to:
https://bugs.launchpad.net/cirros/+bug/1349617/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to     : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

Reply via email to