Excerpts from Jay Pipes's message of 2015-12-23 10:32:27 -0800: > On 12/23/2015 12:27 PM, Lars Kellogg-Stedman wrote: > > I've been looking into the startup constraints involved when launching > > Nova services with systemd using Type=notify (which causes systemd to > > wait for an explicit notification from the service before considering > > it to be "started". Some services (e.g., nova-conductor) will happily > > "start" even if the backing database is currently unavailable (and > > will enter a retry loop waiting for the database). > > > > Other services -- specifically, nova-scheduler -- will block waiting > > for the database *before* providing systemd with the necessary > > notification. > > > > nova-scheduler blocks because it wants to initialize a list of > > available aggregates (in scheduler.host_manager.HostManager.__init__), > > which it gets by calling objects.AggregateList.get_all. > > > > Does it make sense to block service startup at this stage? The > > database disappearing during runtime isn't a hard error -- we will > > retry and reconnect when it comes back -- so should the same situation > > at startup be a hard error? As an operator, I am more interested in > > "did my configuration files parse correctly?" at startup, and would > > generally prefer the service to start (and permit any dependent > > services to start) even when the database isn't up (because that's > > probably a situation of which I am already aware). > > If your configuration file parsed correctly but has the wrong database > connection URI, what good is the service in an active state? It won't be > able to do anything at all. > > This is why I think it's better to have hard checks like for connections > on startup and not have services active if they won't be able to do > anything useful. > > > It would be relatively easy to have the scheduler lazy-load the list > > of aggregates on first references, rather than at __init__. > > Sure, but if the root cause of the issue is a problem due to > misconfigured connection string, then that lazy-load will just bomb out > and the scheduler will be useless anyway. I'd rather have a > fail-early/fast occur here than a fail-late. >
This is entirely philosophical, but we should think about when it is appropriate to adopt which mode of operation. There are basically two ways being discussed: 1) Fail fast. 2) Retry forever. Fail fast pros- Immediate feedback for problems, no zombies to worry about staying dormant and resurrecting because their configs accidentally become right again. Much more determinism. Debugging is much simpler. To summarize, it's up and working, or down and not. Fail fast cons- Ripple effects. If you have a database or network blip while services are starting, you must be aware of all of the downstream dependencies and trigger them to start again, or have automation which retries forever, giving up some of the benefits of fail-fast. Circular dependencies require special workflow to unroll (Service1 aspect A relies on aspect X of service2, service2 aspect X relies on aspect B of service1 which would start fine without service2). To summarize: this moves the retry-forever problem to orchestration, and complicates some corner cases. Retry forever pros- Circular dependencies are cake. Blips auto-recover. Bring-up orchestration is simpler (start everything, wait..). To summarize: this makes orchestration simpler. Retry forever cons- Non-determinism. It's impossible to just look at the thing from outside and know if it is ready to do useful work. May actually be hiding intermittent problems, requiring more logging and indicators in general to allow analysis. I honestly think any distributed system needs both. The more complex the dependencies inside the system get, the more I think you have to deal with the cons of retry-forever, even though this compounds the problem of debugging that system. In designing systems, we should avoid complex dependencies for this reason. That said, the scheduler is, IMO, an _extremely_ complex piece of OpenStack, with up and down stream dependencies on several levels (which is why redesigning it gets debated so often on openstack-dev). Making it fail fast would complicate the process of bringing and keeping an OpenStack cloud up. There are probably some benefits I haven't thought of, but the main benefit you stated would be that one would know when their configuration tooling was wrong and giving their scheduler the wrong database information, which is not, IMO, a hard problem (one can read the config file after all). But I'm sure we could think of more if we tried hard. I hope I'm not too vague here.. I *want* fail-fast on everything. However, I also don't think it can just be a blanket policy without requiring everybody to deploy complex orchestration on top. __________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev