A few weeks ago someone in the operators channel was talking about issues with ceph-backed nova-compute and OSErrors for too many open files causing issues.

We have a bug reported that's very similar sounding:

https://bugs.launchpad.net/nova/+bug/1651526

During the periodic update_available_resource audit, the call to RBD to get disk usage fails with the EMFILE OSError. Since this is in a periodic it doesn't cause any direct operations to fail, but it will cause issues with scheduling as that host is really down, however, nothing sets the service to down (disabled).

I had proposed a solution in the bug report that we could automatically disable the service for that host when this happens, and then automatically enable the service again if/when the next periodic task run is successful. Disabling the service would take that host out of contention for scheduling and may also trigger an alarm for the operator to investigate the failure (although if there are EMFILE errors from the ceph cluster I'm guessing alarms should already be going off).

Anyway, I wanted to see how hacky of an idea this is. We already automatically enable/disable the service from the libvirt driver when the connection to libvirt itself drops via an event callback. This would be similar albeit less sophisticated as it's not using an event listening mechanism, we'd have to maintain some local state in memory to know if we need to enable/disable the service again. And it seems pretty hacky/one-offish to handle this just for the RBD failure, but maybe we just generically handle it for any EMFILE error when collecting disk usage in the resource audit?

--

Thanks,

Matt Riedemann


_______________________________________________
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators

Reply via email to