Re: [Openstack-operators] [openstack-operators][nova] Verbosity of nova scheduler

Alec Hothan (ahothan) Wed, 10 Jan 2018 11:53:55 -0800

Matt,

Thanks for clarifying the logs, the older release I was using did not have much 
information in the scheduler log. I’ll double check on the Newton release to 
see how they look like.
As you mention, a simple pass/fail result may not always explain why it fails 
but it is definitely good to know which filter failed.
I still think that a VM failure to launch should be related to 1 ERROR log 
rather than 1 INFO log. In the example you provide, it is fine to have race 
conditions that result in a rejection and an ERROR log.


The main problem is that the nova API does not return sufficient detail on the 
reason for a NoValidHostFound and perhaps that should be fixed at that level. 
Extending the API to return a reason field which is a json dict that is 
returned by the various filters  (with more meaningful filter-specific info) 
will help tremendously (no more need to go through the log to find out why).

Regards,

  Alec



From: Matt Riedemann <mriede...@gmail.com>
Date: Wednesday, January 10, 2018 at 11:33 AM
To: "openstack-operators@lists.openstack.org" 
<openstack-operators@lists.openstack.org>
Subject: Re: [Openstack-operators] [openstack-operators][nova] Verbosity of 
nova scheduler

On 1/10/2018 1:15 AM, Alec Hothan (ahothan) wrote:
+1 on the “no valid host found”, this one should be at the very top of
the to-be-fixed list.
Very difficult to troubleshoot filters in lab testing (let alone in
production) as there can be many of them. This will get worst with more
NFV optimization filters with so many combinations it gets really
complex to debug when a VM cannot be launched with NFV optimized
flavors. With the scheduler filtering engine, there should be a
systematic way to log the reason for not finding a valid host - at the
very least the error should display which filter failed as an ERROR (and
not as DEBUG).
We really need to avoid deploying with DEBUG log level but unfortunately
this is the only way to troubleshoot. Too many debug-level logs are for
development debug (meaning pretty much useless in any circumstances –
developer forgot to remove before commit of the feature), many errors
that should be logged as ERROR but have been logged as DEBUG only.

The scheduler logs do print which filter returned 0 hosts for a given
request at INFO level.

For example, I was debugging this NoValidHost failure in a CI run:

http://logs.openstack.org/20/531020/3/check/tempest-full/1287dde/job-output.txt.gz#_2018-01-05_17_39_54_336999

And tracing the request ID to the scheduler logs, and filtering on just
INFO level logging to simulate what you'd have for the default in
production, I found the filter that kicked it out here:

http://logs.openstack.org/20/531020/3/check/tempest-full/1287dde/controller/logs/screen-n-sch.txt.gz?level=INFO#_Jan_05_17_00_30_564582

And there is a summary like this:

Jan 05 17:00:30.565996 ubuntu-xenial-infracloud-chocolate-0001705073
nova-scheduler[8932]: INFO nova.filters [None
req-737984ae-3ae8-4506-a5f9-6655a4ebf206
tempest-ServersAdminTestJSON-787960229
tempest-ServersAdminTestJSON-787960229] Filtering removed all hosts for
the request with instance ID '8ae8dc23-8f3b-4f0f-8775-2dcc2a5fc75b'.
Filter results: ['RetryFilter: (start: 1, end: 1)',
'AvailabilityZoneFilter: (start: 1, end: 1)', 'ComputeFilter: (start: 1,
end: 1)', 'ComputeCapabilitiesFilter: (start: 1, end: 1)',
'ImagePropertiesFilter: (start: 1, end: 1)',
'ServerGroupAntiAffinityFilter: (start: 1, end: 1)',
'ServerGroupAffinityFilter: (start: 1, end: 1)', 'SameHostFilter:
(start: 1, end: 0)']

"end: 0" means that's the filter that rejected the request. Without
digging into the actual code, the descriptions for the filters is here:

https://docs.openstack.org/nova/latest/user/filter-scheduler.html

Now just why this request failed requires a bit of understanding of why
my environment looks like (this CI run is using the CachingScheduler),
so there isn't a simple "ERROR: SameHostFilter rejected request because
you're using the CachingScheduler which is racy by design and you
created the instances in separate requests". You'd have a ton of false
negative ERRORs in the logs because of valid reasons for rejected a
request based on the current state of the system, which is going to make
debugging real issues that much harder.

--

Thanks,

Matt

_______________________________________________
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org<mailto:OpenStack-operators@lists.openstack.org>
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators

_______________________________________________
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators

Re: [Openstack-operators] [openstack-operators][nova] Verbosity of nova scheduler

Reply via email to