On 07/07/2017 01:37 PM, Peter Doherty wrote:
Thanks. I wrongfully assumed it was being run automatically, so with
that out of my mind, it didn't take too long to figure out what was
triggering that. I'm running the Datadog agent, which is the source.
It generated enough noise in a week I ended up with a million rows in
the nova.instance_fault table, and the memory footprint of nova-api got
very large, all of which resulted in multi-minute responses to instance
list queries.
Heh, yes, that performance issue when doing a list instances with a
large instance_faults table has come up before. We fixed that in Ocata,
though:
https://bugs.launchpad.net/nova/+bug/1632247
I can open a bug report about the log messages. I think it may be in
the nova/compute/manager.py code, which doesn't seem to gracefully know
what to do if get_diagnostics is called on a instance that isn't
running, and results in a lot of useless rows in the instance_fault table.
@wrap_instance_fault
def get_diagnostics(self, context, instance):
"""Retrieve diagnostics for an instance on this host."""
current_power_state = self._get_power_state(context, instance)
if current_power_state == power_state.RUNNING:
LOG.info(_LI("Retrieving diagnostics"), context=context,
instance=instance)
return self.driver.get_diagnostics(instance)
else:
raise exception.InstanceInvalidState(
attr='power_state',
instance_uuid=instance.uuid,
state=instance.power_state,
method='get_diagnostics')
Yep, that's the code that emits the exception. We should be just
returning an error to the user instead of raising an exception. And, we
should not be adding a record to the instance_faults table (which is
what that @wrap_instance_fault decorator does when it sees an exception
raised like that).
If you could create a bug on LP for that, I'd very much appreciate it.
All the best,
-jay
Thanks Jay!
-Peter
On Fri, Jul 7, 2017 at 12:50 PM, Jay Pipes <jaypi...@gmail.com
<mailto:jaypi...@gmail.com>> wrote:
On 07/07/2017 12:30 PM, Peter Doherty wrote:
Hi,
If I'm interpreting this correctly, nova compute is calling
get_diagnostics on all instances, including ones currently in a
shutdown state. And then it throws an exception, and adds an
entry into the instance_faults table in the database.
nova-compute logs this message:
2017-07-07 16:29:46.184 23077 ERROR
oslo_messaging.rpc.dispatcher Traceback (most recent call last):
2017-07-07 16:29:46.184 23077 ERROR
oslo_messaging.rpc.dispatcher File
"/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py",
line 142, in _dispatch_and_reply
2017-07-07 16:29:46.184 23077 ERROR
oslo_messaging.rpc.dispatcher executor_callback))
2017-07-07 16:29:46.184 23077 ERROR
oslo_messaging.rpc.dispatcher File
"/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py",
line 186, in _dispatch
2017-07-07 16:29:46.184 23077 ERROR
oslo_messaging.rpc.dispatcher executor_callback)
2017-07-07 16:29:46.184 23077 ERROR
oslo_messaging.rpc.dispatcher File
"/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py",
line 129, in _do_dispatch
2017-07-07 16:29:46.184 23077 ERROR
oslo_messaging.rpc.dispatcher result = func(ctxt, **new_args)
2017-07-07 16:29:46.184 23077 ERROR
oslo_messaging.rpc.dispatcher File
"/usr/lib/python2.7/site-packages/nova/exception.py", line 89,
in wrapped
2017-07-07 16:29:46.184 23077 ERROR
oslo_messaging.rpc.dispatcher payload)
2017-07-07 16:29:46.184 23077 ERROR
oslo_messaging.rpc.dispatcher File
"/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line
195, in __exit__
2017-07-07 16:29:46.184 23077 ERROR
oslo_messaging.rpc.dispatcher six.reraise(self.type_,
self.value, self.tb)
2017-07-07 16:29:46.184 23077 ERROR
oslo_messaging.rpc.dispatcher File
"/usr/lib/python2.7/site-packages/nova/exception.py", line 72,
in wrapped
2017-07-07 16:29:46.184 23077 ERROR
oslo_messaging.rpc.dispatcher return f(self, context, *args,
**kw)
2017-07-07 16:29:46.184 23077 ERROR
oslo_messaging.rpc.dispatcher File
"/usr/lib/python2.7/site-packages/nova/compute/manager.py", line
378, in decorated_function
2017-07-07 16:29:46.184 23077 ERROR
oslo_messaging.rpc.dispatcher kwargs['instance'], e,
sys.exc_info())
2017-07-07 16:29:46.184 23077 ERROR
oslo_messaging.rpc.dispatcher File
"/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line
195, in __exit__
2017-07-07 16:29:46.184 23077 ERROR
oslo_messaging.rpc.dispatcher six.reraise(self.type_,
self.value, self.tb)
2017-07-07 16:29:46.184 23077 ERROR
oslo_messaging.rpc.dispatcher File
"/usr/lib/python2.7/site-packages/nova/compute/manager.py", line
366, in decorated_function
2017-07-07 16:29:46.184 23077 ERROR
oslo_messaging.rpc.dispatcher return function(self, context,
*args, **kwargs)
2017-07-07 16:29:46.184 23077 ERROR
oslo_messaging.rpc.dispatcher File
"/usr/lib/python2.7/site-packages/nova/compute/manager.py", line
4089, in get_diagnostics
2017-07-07 16:29:46.184 23077 ERROR
oslo_messaging.rpc.dispatcher method='get_diagnostics')
2017-07-07 16:30:10.017 23077 ERROR
oslo_messaging.rpc.dispatcher InstanceInvalidState: Instance
6ab60005-ccbf-4bc2-95ac-7daf31716754 in power_state 4. Cannot
get_diagnostics while the instance is in this state.
I don't think it should be trying to gather diags on shutdown
instances, and if it did, it shouldn't just create a
never-ending stream of errors.
If anyone has any info on if this might be a bug that is fixed
in the latest release, or if I can turn off this behavior, it
would be appreciated.
get_diagnostics() doesn't run automatically. Something is triggering
a call to get_diagnostics() for each instance on the box (the
internal compute manager only has a get_diagnostics(instance) call
that takes one instance at a time). Not sure what is triggering that...
I agree with you that ERRORs shouldn't be spewed into the
nova-compute logs like the above, though. That should be fixed.
Would you mind submitting a bug for that on Launchpad, Peter?
Thank you!
-jay
_______________________________________________
Mailing list:
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
<http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack>
Post to : openstack@lists.openstack.org
<mailto:openstack@lists.openstack.org>
Unsubscribe :
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
<http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack>
--
Peter Doherty
Systems Engineer, Systems Engineering
Brightcove Inc.
290 Congress St., 4th Floor, Boston, MA 02210
_______________________________________________
Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
Post to : openstack@lists.openstack.org
Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
_______________________________________________
Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
Post to : openstack@lists.openstack.org
Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack