On 2014-11-18 at 11:50 -1000, Mathew Snyder wrote:
> This leads to my question to the list: those of you who have cloud
> environments based on VMware solutions, how do you keep time in sync? What
> issues have you encountered and how did you solve those problems? What can
> you recommend for a virtualized NTP solution?

My only exposure to VMware is via desktop products and vSphere on a tiny
dev cluster, so can't speak to that in particular for production.

Everything I've seen says that whether NTP inside a VM can work depends
upon the version of the kernel, the CPU support for various timers, and
the versions of the drivers around.

I've also seen, if memory serves, that having driftfiles configured is a
bad idea inside VMs.  My recollection (possibly flawed) is that NTP's
back-off for how often it checks for time sync can go a lot further with
a drift-file configured.  This makes sense, with bare-metal servers if
the drift is fairly stable.  It definitely is not useful if your VM
moves from a box with drift in one direction, to a box with drift in the
other, and ntp is merrily automatically compensating for the time based
on what had been a stable drift.

Not directly VM-related, except insofar as VMs are used to create
clusters distinct from other resources: isolation handling.

What I do make sure of, in production, is that if there's a network
isolation event, then the machines within the cluster will stay
synchronized.  It's bad enough to have network outages, without also
having the internals of a cluster fall apart because they then started
disagreeing on the time.  It doesn't matter if the time agrees with the
outside world, as long as it's internally consistent.

To do this, make sure that the highest-level stratum boxes within your
cluster all peer with each other, and have a mechanism configured to use
local time sources (clocks + drift) to set themselves at stratum 12.
This _used_ to be done with `server 127.127.1.0` and `fudge 127.127.1.0
stratum 12` but there's some newer mechanism which is supposed to be
used these days, because of some deficiency of this approach.  I don't
remember the details, only that I couldn't get the new method working at
all, so went back to this.

So, in normal operation you have some set of 5 or so NTP servers talking
to the outside world, all at roughly stratum 3.  With 5, you can lose
one and have a false-ticker and still have three solid time sources.
The rest of your machines are then at stratum 4.  When network isolation
occurs, those front-line boxes drop to stratum 12, the rest to stratum
13.  Your monitoring can pick up on this change; just make sure that
13/14 is still not "Critical" because at this point, it's capital B Bad
but the rest of your monitoring will be screaming too and you don't need
to be told that NTP is correctly falling back defensively to try to hold
your resources together.

In my experience, this works well enough even when the "servers" are all
VMs.

Since doing this, I haven't (yet) seen any problems forcing me back to
ntpdate-from-cron.

So:
 * avoid free-wheeling on hardware-dependent data
 * ensure you can maintain internal sync
 * think about a VM cluster as an isolateable unit and figure out what
   your fallback position needs to be when you lose external access

Be ready to switch approaches when it turns out that what works in one
environment doesn't work in another; ntpdate-from-cron (or
ntp-from-cron) is a good approach to have available to switch to, as
"better than no sync".  Jumping through hoops to satisfy vendors and
prove problems lie elsewhere is nothing new, just make sure you can jump
back when the time comes.

-Phil
_______________________________________________
Tech mailing list
[email protected]
https://lists.lopsa.org/cgi-bin/mailman/listinfo/tech
This list provided by the League of Professional System Administrators
 http://lopsa.org/

Reply via email to