Re: revision control system for system administration

Brian Candler Wed, 20 Dec 2006 06:32:30 -0800

On Tue, Dec 19, 2006 at 06:23:16AM -0700, Clint Pachl wrote:
> >A pull-only system assumes that the clients actually pull. What if
> >they don't? How do you know when their last successful pull was?
> 
> If you implement a "push" system, how do you know if something was 
> actually pushed? What if something was pushed, how do you know the 
> "pushee" did the right thing with what it was given? This argument goes 
> both ways, but solved simply. A system should report what it does after 
> it pushes or pulls. The other end should also report. So if the results 
> show someone is pushing, but no one is pulling or visa-versa, you have a 
> problem. This system could be implemented using mail or central syslog.
> 
> A good argument for "pull" systems:
> http://www.infrastructures.org/bootstrap/pushpull.shtml
> 
> What do others think about push vs pull management systems? What tools 
> are you using to implement your push/pull management system?


An orthogonal issue, which I don't think has been explicitly mentioned so
far, is whether you make config changes on the central repository (and
replicate them out to the target), or locally on the target system (and
replicate them back to the central repository)

>From infrastructures.org:

  We have developed a rule that works very well in practice and saves us a
  lot of heartache: "Never log into a machine to change anything on it.
  Always make the change on the gold server and let the change propagate out."

That makes a lot of sense. But enforcing that policy might be difficult.
This is important if you're relying on your gold server for disaster
recovery purposes - if the target machines had some change made which nobody
remembers and weren't reflected in the gold server, then any freshly-built
machines will be non-functional.

You could have some Tripwire-like system to monitor periodically for
unauthorised changes, so you can slap the wrist of anyone who breaks the
policy - and more importantly, bring the central repository back into sync
with what was done.

Or you could block root logins entirely, but then you need to carefully
select a list of sudo actions which are needed for (e.g.) restarting daemons
and diagnosing and correcting common problems.

The alternative is that changes are allowed to be made on target machines,
and then later checked into a central repository as a record after the fact.
This makes it harder to make identical changes to a large number of machines
in a cluster. It's also possible again to get out of sync between the real
machine and the repository, if the procedures are not properly followed.

A similar issue occurs with init scripts, interface configuration, and
starting and stopping daemons. On many occasions I have come across problems
where a box has been running perfectly for 2 years, but when it was rebooted
for some reason, it stopped working. It turned out this was because someone
made a manual change, such as starting some daemon perhaps with particular
command-line flags, or changing filewall rules, but when the box rebooted it
did not come up the same way at startup. Since the original change may have
been made a long time ago by someone who has long-since left, you can end up
with emergency situations which are difficult to fix quickly.

This problem seems to be more difficult to solve. Ideally there would be a
single interface through which you performed any sysadmin action, such as
configuring an interface or starting a daemon, which kept a persistent
record of this and performed the same action at startup. That would mean,
for example, being forbidden to use 'ifconfig' directly, but being allowed
to change /etc/hostname.* and run an rc script to apply the changes. This is
more difficult with rc.conf: you would need a supervisor script which
noticed (say) that run_foo="NO" had changed to run_foo="YES", or vice versa,
and performed the appropriate actions. It might actually be easier if using
something like daemontools, which has separate control files for each
daemon.

I've never seen a centralised management system which works directly in this
way, but I'd love to have one.

Finally, a similar problem occurs when deciding how to do configuration
management of, say, Cisco routers. However, your hand is forced a bit more
there: you generally can't just push a new config out to each box, because
to make the changes active you'd need to reboot it (a Cisco doesn't have the
ability to take a diff between its current active state and a target state,
and perform only the changes necessary to bring it up to that state)

So often you end up having to make changes directly on the target device
line by line, and then tftp'ing the updated configs back to a central
repository. That is, the central repository is not the place where changes
are made, but just a record of changes which were made. Again, you can get
into problems with procedures not being followed and the repository coming
out of sync with reality.

Regards,

Brian.

Re: revision control system for system administration

Reply via email to