[Openstack-operators] [nova] Automatically disabling compute service on RBD EMFILE failures

Matt Riedemann Sat, 07 Jan 2017 10:10:06 -0800

A few weeks ago someone in the operators channel was talking aboutissues with ceph-backed nova-compute and OSErrors for too many openfiles causing issues.


We have a bug reported that's very similar sounding:


https://bugs.launchpad.net/nova/+bug/1651526

During the periodic update_available_resource audit, the call to RBD toget disk usage fails with the EMFILE OSError. Since this is in aperiodic it doesn't cause any direct operations to fail, but it willcause issues with scheduling as that host is really down, however,nothing sets the service to down (disabled).

I had proposed a solution in the bug report that we could automaticallydisable the service for that host when this happens, and thenautomatically enable the service again if/when the next periodic taskrun is successful. Disabling the service would take that host out ofcontention for scheduling and may also trigger an alarm for the operatorto investigate the failure (although if there are EMFILE errors from theceph cluster I'm guessing alarms should already be going off).

Anyway, I wanted to see how hacky of an idea this is. We alreadyautomatically enable/disable the service from the libvirt driver whenthe connection to libvirt itself drops via an event callback. This wouldbe similar albeit less sophisticated as it's not using an eventlistening mechanism, we'd have to maintain some local state in memory toknow if we need to enable/disable the service again. And it seems prettyhacky/one-offish to handle this just for the RBD failure, but maybe wejust generically handle it for any EMFILE error when collecting diskusage in the resource audit?


--

Thanks,

Matt Riedemann


_______________________________________________
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators

[Openstack-operators] [nova] Automatically disabling compute service on RBD EMFILE failures

Reply via email to