On Wed, Dec 23, 2015 at 10:32 AM, Jay Pipes <jaypi...@gmail.com> wrote:
> On 12/23/2015 12:27 PM, Lars Kellogg-Stedman wrote: > >> I've been looking into the startup constraints involved when launching >> Nova services with systemd using Type=notify (which causes systemd to >> wait for an explicit notification from the service before considering >> it to be "started". Some services (e.g., nova-conductor) will happily >> "start" even if the backing database is currently unavailable (and >> will enter a retry loop waiting for the database). >> >> Other services -- specifically, nova-scheduler -- will block waiting >> for the database *before* providing systemd with the necessary >> notification. >> >> nova-scheduler blocks because it wants to initialize a list of >> available aggregates (in scheduler.host_manager.HostManager.__init__), >> which it gets by calling objects.AggregateList.get_all. >> >> Does it make sense to block service startup at this stage? The >> database disappearing during runtime isn't a hard error -- we will >> retry and reconnect when it comes back -- so should the same situation >> at startup be a hard error? As an operator, I am more interested in >> "did my configuration files parse correctly?" at startup, and would >> generally prefer the service to start (and permit any dependent >> services to start) even when the database isn't up (because that's >> probably a situation of which I am already aware). >> > > If your configuration file parsed correctly but has the wrong database > connection URI, what good is the service in an active state? It won't be > able to do anything at all. > > This is why I think it's better to have hard checks like for connections > on startup and not have services active if they won't be able to do > anything useful. > > Are you advocating that scheduler bails out and ceases to run or that it doesn't mark itself as active? I am in favour of the second scenario but not the first. There are cases where it would be nice to start the scheduler and have it at least report "hey I can't contact the DB" but not mark itself active, but continue to run and on <interval> report/try to reconnect. It isn't clear which level of "hard check" you're advocating in your response and I want to clarify for the sake of conversation. > It would be relatively easy to have the scheduler lazy-load the list >> of aggregates on first references, rather than at __init__. >> > > Sure, but if the root cause of the issue is a problem due to misconfigured > connection string, then that lazy-load will just bomb out and the scheduler > will be useless anyway. I'd rather have a fail-early/fast occur here than a > fail-late. > > Best, > -jay > > > I'm not > >> familiar enough with the nova code to know if there would be any >> undesirable implications of this behavior. We're already punting >> initializing the list of instances to an asynchronous task in order to >> avoid blocking service startup. >> >> Does it make sense to permit nova-scheduler to complete service >> startup in the absence of the database (and then retry the connection >> in the background)? >> >> >> >> __________________________________________________________________________ >> OpenStack Development Mailing List (not for usage questions) >> Unsubscribe: >> openstack-dev-requ...@lists.openstack.org?subject:unsubscribe >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >> >> > __________________________________________________________________________ > OpenStack Development Mailing List (not for usage questions) > Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >
__________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev