On 08/02/2018 06:18 PM, Michael Glasgow wrote:
On 08/02/18 15:04, Chris Friesen wrote:
On 08/02/2018 01:04 PM, melanie witt wrote:
The problem is an infamous one, which is, your users are trying to boot
instances and they get "No Valid Host" and an instance in ERROR
state. They contact support, and now support is trying to determine
why NoValidHost happened. In the past, they would turn on DEBUG log
level on the nova-scheduler, try another request, and take a look at
the scheduler logs.
At a previous Summit[1] there were some operators that said they just
always ran nova-scheduler with debug logging enabled in order to deal
with this issue, but that it was a pain [...]
I would go a bit further and say it's likely to be unacceptable on a
large cluster. It's expensive to deal with all those logs and to
manually comb through them for troubleshooting this issue type, which
can happen frequently with some setups. Secondarily there are
performance and security concerns with leaving debug on all the time.
As to "defining the problem", I think it's what Melanie said. It's
about asking for X and the system saying, "sorry, can't give you X" with
no further detail or even means of discovering it.
More generally, any time a service fails to deliver a resource which it
is primarily designed to deliver, it seems to me at this stage that
should probably be taken a bit more seriously than just "check the log
file, maybe there's something in there?" From the user's perspective,
if nova fails to produce an instance, or cinder fails to produce a
volume, or neutron fails to build a subnet, that's kind of a big deal,
right?
In such cases, would it be possible to generate a detailed exception
object which contains all the necessary info to ascertain why that
specific failure occurred?
It's not an exception. It's normal course of events. NoValidHosts means
there were no compute nodes that met the requested resource amounts.
There's plenty of ways the operator can get usage and trait information
and determine if there are providers that meet the requested amounts and
required/forbidden traits.
What we're talking about here is debugging information, plain and simple.
If a SELECT statement against an Oracle DB returns 0 rows, is that an
exception? No. Would an operator need to re-send the SELECT statement
with an EXPLAIN SELECT in order to get information about what indexes
were used to winnow the result set (to zero)? Yes. Either that, or the
operator would need to gradually re-execute smaller SELECT statements
containing fewer filters in order to determine which join or predicate
caused a result set to contain zero rows.
That's exactly what we're talking about here. It's not an exception.
It's debugging information.
Best,
-jay
__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev