On 07/07/2017 01:37 PM, Peter Doherty wrote:
Thanks. I wrongfully assumed it was being run automatically, so with that out of my mind, it didn't take too long to figure out what was triggering that. I'm running the Datadog agent, which is the source. It generated enough noise in a week I ended up with a million rows in the nova.instance_fault table, and the memory footprint of nova-api got very large, all of which resulted in multi-minute responses to instance list queries.

Heh, yes, that performance issue when doing a list instances with a large instance_faults table has come up before. We fixed that in Ocata, though:

https://bugs.launchpad.net/nova/+bug/1632247

I can open a bug report about the log messages. I think it may be in the nova/compute/manager.py code, which doesn't seem to gracefully know what to do if get_diagnostics is called on a instance that isn't running, and results in a lot of useless rows in the instance_fault table.

   @wrap_instance_fault
     def get_diagnostics(self, context, instance):
         """Retrieve diagnostics for an instance on this host."""
         current_power_state = self._get_power_state(context, instance)
         if current_power_state == power_state.RUNNING:
             LOG.info(_LI("Retrieving diagnostics"), context=context,
                       instance=instance)
             return self.driver.get_diagnostics(instance)
         else:
             raise exception.InstanceInvalidState(
                 attr='power_state',
                 instance_uuid=instance.uuid,
                 state=instance.power_state,
                 method='get_diagnostics')

Yep, that's the code that emits the exception. We should be just returning an error to the user instead of raising an exception. And, we should not be adding a record to the instance_faults table (which is what that @wrap_instance_fault decorator does when it sees an exception raised like that).

If you could create a bug on LP for that, I'd very much appreciate it.

All the best,
-jay

Thanks Jay!

-Peter

On Fri, Jul 7, 2017 at 12:50 PM, Jay Pipes <jaypi...@gmail.com <mailto:jaypi...@gmail.com>> wrote:

    On 07/07/2017 12:30 PM, Peter Doherty wrote:

        Hi,

        If I'm interpreting this correctly, nova compute is calling
        get_diagnostics on all instances, including ones currently in a
        shutdown state.  And then it throws an exception, and adds an
        entry into the instance_faults table in the database.

        nova-compute logs this message:

        2017-07-07 16:29:46.184 23077 ERROR
        oslo_messaging.rpc.dispatcher Traceback (most recent call last):
        2017-07-07 16:29:46.184 23077 ERROR
        oslo_messaging.rpc.dispatcher   File
        "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py",
        line 142, in _dispatch_and_reply
        2017-07-07 16:29:46.184 23077 ERROR
        oslo_messaging.rpc.dispatcher     executor_callback))
        2017-07-07 16:29:46.184 23077 ERROR
        oslo_messaging.rpc.dispatcher   File
        "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py",
        line 186, in _dispatch
        2017-07-07 16:29:46.184 23077 ERROR
        oslo_messaging.rpc.dispatcher     executor_callback)
        2017-07-07 16:29:46.184 23077 ERROR
        oslo_messaging.rpc.dispatcher   File
        "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py",
        line 129, in _do_dispatch
        2017-07-07 16:29:46.184 23077 ERROR
        oslo_messaging.rpc.dispatcher     result = func(ctxt, **new_args)
        2017-07-07 16:29:46.184 23077 ERROR
        oslo_messaging.rpc.dispatcher   File
        "/usr/lib/python2.7/site-packages/nova/exception.py", line 89,
        in wrapped
        2017-07-07 16:29:46.184 23077 ERROR
        oslo_messaging.rpc.dispatcher     payload)
        2017-07-07 16:29:46.184 23077 ERROR
        oslo_messaging.rpc.dispatcher   File
        "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line
        195, in __exit__
        2017-07-07 16:29:46.184 23077 ERROR
        oslo_messaging.rpc.dispatcher     six.reraise(self.type_,
        self.value, self.tb)
        2017-07-07 16:29:46.184 23077 ERROR
        oslo_messaging.rpc.dispatcher   File
        "/usr/lib/python2.7/site-packages/nova/exception.py", line 72,
        in wrapped
        2017-07-07 16:29:46.184 23077 ERROR
        oslo_messaging.rpc.dispatcher     return f(self, context, *args,
        **kw)
        2017-07-07 16:29:46.184 23077 ERROR
        oslo_messaging.rpc.dispatcher   File
        "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line
        378, in decorated_function
        2017-07-07 16:29:46.184 23077 ERROR
        oslo_messaging.rpc.dispatcher     kwargs['instance'], e,
        sys.exc_info())
        2017-07-07 16:29:46.184 23077 ERROR
        oslo_messaging.rpc.dispatcher   File
        "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line
        195, in __exit__
        2017-07-07 16:29:46.184 23077 ERROR
        oslo_messaging.rpc.dispatcher     six.reraise(self.type_,
        self.value, self.tb)
        2017-07-07 16:29:46.184 23077 ERROR
        oslo_messaging.rpc.dispatcher   File
        "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line
        366, in decorated_function
        2017-07-07 16:29:46.184 23077 ERROR
        oslo_messaging.rpc.dispatcher     return function(self, context,
        *args, **kwargs)
        2017-07-07 16:29:46.184 23077 ERROR
        oslo_messaging.rpc.dispatcher   File
        "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line
        4089, in get_diagnostics
        2017-07-07 16:29:46.184 23077 ERROR
        oslo_messaging.rpc.dispatcher     method='get_diagnostics')

        2017-07-07 16:30:10.017 23077 ERROR
        oslo_messaging.rpc.dispatcher InstanceInvalidState: Instance
        6ab60005-ccbf-4bc2-95ac-7daf31716754 in power_state 4. Cannot
        get_diagnostics while the instance is in this state.

        I don't think it should be trying to gather diags on shutdown
        instances, and if it did, it shouldn't just create a
        never-ending stream of errors.
        If anyone has any info on if this might be a bug that is fixed
        in the latest release, or if I can turn off this behavior, it
        would be appreciated.


    get_diagnostics() doesn't run automatically. Something is triggering
    a call to get_diagnostics() for each instance on the box (the
    internal compute manager only has a get_diagnostics(instance) call
    that takes one instance at a time). Not sure what is triggering that...

    I agree with you that ERRORs shouldn't be spewed into the
    nova-compute logs like the above, though. That should be fixed.
    Would you mind submitting a bug for that on Launchpad, Peter?

    Thank you!
    -jay

    _______________________________________________
    Mailing list:
    http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
    <http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack>
    Post to     : openstack@lists.openstack.org
    <mailto:openstack@lists.openstack.org>
    Unsubscribe :
    http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
    <http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack>




--
Peter Doherty
Systems Engineer, Systems Engineering
Brightcove Inc.
290 Congress St., 4th Floor, Boston, MA 02210


_______________________________________________
Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
Post to     : openstack@lists.openstack.org
Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack


_______________________________________________
Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
Post to     : openstack@lists.openstack.org
Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack

Reply via email to