Excerpts from Macdonald-Wallace, Matthew's message of 2014-06-26 04:13:31 -0700: > Hi all, > > I've been working more and more with TripleO recently and whilst it does seem > to solve a number of problems well, I have found a couple of idiosyncrasies > that I feel would be easy to address. > > My primary concern lies in the fact that os-refresh-config does not run on > every boot/reboot of a system. Surely a reboot *is* a configuration change > and therefore we should ensure that the box has come up in the expected state > with the correct config? > > This is easily fixed through the addition of an "@reboot" entry in > /etc/crontab to run o-r-c or (less easily) by re-designing o-r-c to run as a > service. > > My secondary concern is that through not running os-refresh-config on a > regular basis by default (i.e. every 15 minutes or something in the same > style as chef/cfengine/puppet), we leave ourselves exposed to someone trying > to make a "quick fix" to a production node and taking that node offline the > next time it reboots because the config was still left as broken owing to a > lack of updates to HEAT (I'm thinking a "quick change" to allow root access > via SSH during a major incident that is then left unchanged for months > because no-one updated HEAT). > > There are a number of options to fix this including Modifying > os-collect-config to auto-run os-refresh-config on a regular basis or setting > os-refresh-config to be its own service running via upstart or similar that > triggers every 15 minutes > > I'm sure there are other solutions to these problems, however I know from > experience that claiming this is solved through "education of users" or (more > severely!) via HR is not a sensible approach to take as by the time you > realise that your configuration has been changed for the last 24 hours it's > often too late! >
So I see two problems highlighted above. 1) We don't re-assert ephemeral state set by o-r-c scripts. You're right, and we've been talking about it for a while. The right thing to do is have os-collect-config re-run its command on boot. I don't think a cron job is the right way to go, we should just have a file in /var/run that is placed there only on a successful run of the command. If that file does not exist, then we run the command. I've just opened this bug in response: https://bugs.launchpad.net/os-collect-config/+bug/1334804 2) We don't re-assert any state on a regular basis. So one reason we haven't focused on this, is that we have a stretch goal of running with a readonly root partition. It's gotten lost in a lot of the craziness of "just get it working", but with rebuilds blowing away root now, leading to anything not on the state drive (/mnt currently), there's a good chance that this will work relatively well. Now, since people get root, they can always override the readonly root and make changes. <golem>we hates thiss!</golem>. I'm open to ideas, however, os-refresh-config is definitely not the place to solve this. It is intended as a non-resident command to be called when it is time to assert state. os-collect-config is intended to gather configurations, and expose them to a command that it runs, and thus should be the mechanism by which os-refresh-config is run. I'd like to keep this conversation separate from one in which we discuss more mechanisms to make os-refresh-config robust. There are a bunch of things we can do, but I think we should focus just on "how do we re-assert state?". Because we're able to say right now that it is only for running when config changes, we can wave our hands and say it's ok that we restart everything on every run. As Jan alluded to, that won't work so well if we run it every 20 minutes. So, I wonder if we can introduce a config version into os-collect-config. Basically os-collect-config would keep a version along with its cache. Whenever a new version is detected, os-collect-config would set a value in the environment that informs the command "this is a new version of config". From that, scripts can do things like this: if [ -n "$OS_CONFIG_NEW_VERSION" ] ; then service X restart else if !service X status ; then service X start fi This would lay the groundwork for future abilities to compare old/new so we can take shortcuts by diffing the two config versions. For instance if we look at old vs. new and we don't see any of the keys we care about changed, we can skip restarting. _______________________________________________ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev