Matt, Thanks for clarifying the logs, the older release I was using did not have much information in the scheduler log. I’ll double check on the Newton release to see how they look like. As you mention, a simple pass/fail result may not always explain why it fails but it is definitely good to know which filter failed. I still think that a VM failure to launch should be related to 1 ERROR log rather than 1 INFO log. In the example you provide, it is fine to have race conditions that result in a rejection and an ERROR log.
The main problem is that the nova API does not return sufficient detail on the reason for a NoValidHostFound and perhaps that should be fixed at that level. Extending the API to return a reason field which is a json dict that is returned by the various filters (with more meaningful filter-specific info) will help tremendously (no more need to go through the log to find out why). Regards, Alec From: Matt Riedemann <mriede...@gmail.com> Date: Wednesday, January 10, 2018 at 11:33 AM To: "openstack-operators@lists.openstack.org" <openstack-operators@lists.openstack.org> Subject: Re: [Openstack-operators] [openstack-operators][nova] Verbosity of nova scheduler On 1/10/2018 1:15 AM, Alec Hothan (ahothan) wrote: +1 on the “no valid host found”, this one should be at the very top of the to-be-fixed list. Very difficult to troubleshoot filters in lab testing (let alone in production) as there can be many of them. This will get worst with more NFV optimization filters with so many combinations it gets really complex to debug when a VM cannot be launched with NFV optimized flavors. With the scheduler filtering engine, there should be a systematic way to log the reason for not finding a valid host - at the very least the error should display which filter failed as an ERROR (and not as DEBUG). We really need to avoid deploying with DEBUG log level but unfortunately this is the only way to troubleshoot. Too many debug-level logs are for development debug (meaning pretty much useless in any circumstances – developer forgot to remove before commit of the feature), many errors that should be logged as ERROR but have been logged as DEBUG only. The scheduler logs do print which filter returned 0 hosts for a given request at INFO level. For example, I was debugging this NoValidHost failure in a CI run: http://logs.openstack.org/20/531020/3/check/tempest-full/1287dde/job-output.txt.gz#_2018-01-05_17_39_54_336999 And tracing the request ID to the scheduler logs, and filtering on just INFO level logging to simulate what you'd have for the default in production, I found the filter that kicked it out here: http://logs.openstack.org/20/531020/3/check/tempest-full/1287dde/controller/logs/screen-n-sch.txt.gz?level=INFO#_Jan_05_17_00_30_564582 And there is a summary like this: Jan 05 17:00:30.565996 ubuntu-xenial-infracloud-chocolate-0001705073 nova-scheduler[8932]: INFO nova.filters [None req-737984ae-3ae8-4506-a5f9-6655a4ebf206 tempest-ServersAdminTestJSON-787960229 tempest-ServersAdminTestJSON-787960229] Filtering removed all hosts for the request with instance ID '8ae8dc23-8f3b-4f0f-8775-2dcc2a5fc75b'. Filter results: ['RetryFilter: (start: 1, end: 1)', 'AvailabilityZoneFilter: (start: 1, end: 1)', 'ComputeFilter: (start: 1, end: 1)', 'ComputeCapabilitiesFilter: (start: 1, end: 1)', 'ImagePropertiesFilter: (start: 1, end: 1)', 'ServerGroupAntiAffinityFilter: (start: 1, end: 1)', 'ServerGroupAffinityFilter: (start: 1, end: 1)', 'SameHostFilter: (start: 1, end: 0)'] "end: 0" means that's the filter that rejected the request. Without digging into the actual code, the descriptions for the filters is here: https://docs.openstack.org/nova/latest/user/filter-scheduler.html Now just why this request failed requires a bit of understanding of why my environment looks like (this CI run is using the CachingScheduler), so there isn't a simple "ERROR: SameHostFilter rejected request because you're using the CachingScheduler which is racy by design and you created the instances in separate requests". You'd have a ton of false negative ERRORs in the logs because of valid reasons for rejected a request based on the current state of the system, which is going to make debugging real issues that much harder. -- Thanks, Matt _______________________________________________ OpenStack-operators mailing list OpenStack-operators@lists.openstack.org<mailto:OpenStack-operators@lists.openstack.org> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
_______________________________________________ OpenStack-operators mailing list OpenStack-operators@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators