On 2014-11-18 at 11:50 -1000, Mathew Snyder wrote: > This leads to my question to the list: those of you who have cloud > environments based on VMware solutions, how do you keep time in sync? What > issues have you encountered and how did you solve those problems? What can > you recommend for a virtualized NTP solution?
My only exposure to VMware is via desktop products and vSphere on a tiny dev cluster, so can't speak to that in particular for production. Everything I've seen says that whether NTP inside a VM can work depends upon the version of the kernel, the CPU support for various timers, and the versions of the drivers around. I've also seen, if memory serves, that having driftfiles configured is a bad idea inside VMs. My recollection (possibly flawed) is that NTP's back-off for how often it checks for time sync can go a lot further with a drift-file configured. This makes sense, with bare-metal servers if the drift is fairly stable. It definitely is not useful if your VM moves from a box with drift in one direction, to a box with drift in the other, and ntp is merrily automatically compensating for the time based on what had been a stable drift. Not directly VM-related, except insofar as VMs are used to create clusters distinct from other resources: isolation handling. What I do make sure of, in production, is that if there's a network isolation event, then the machines within the cluster will stay synchronized. It's bad enough to have network outages, without also having the internals of a cluster fall apart because they then started disagreeing on the time. It doesn't matter if the time agrees with the outside world, as long as it's internally consistent. To do this, make sure that the highest-level stratum boxes within your cluster all peer with each other, and have a mechanism configured to use local time sources (clocks + drift) to set themselves at stratum 12. This _used_ to be done with `server 127.127.1.0` and `fudge 127.127.1.0 stratum 12` but there's some newer mechanism which is supposed to be used these days, because of some deficiency of this approach. I don't remember the details, only that I couldn't get the new method working at all, so went back to this. So, in normal operation you have some set of 5 or so NTP servers talking to the outside world, all at roughly stratum 3. With 5, you can lose one and have a false-ticker and still have three solid time sources. The rest of your machines are then at stratum 4. When network isolation occurs, those front-line boxes drop to stratum 12, the rest to stratum 13. Your monitoring can pick up on this change; just make sure that 13/14 is still not "Critical" because at this point, it's capital B Bad but the rest of your monitoring will be screaming too and you don't need to be told that NTP is correctly falling back defensively to try to hold your resources together. In my experience, this works well enough even when the "servers" are all VMs. Since doing this, I haven't (yet) seen any problems forcing me back to ntpdate-from-cron. So: * avoid free-wheeling on hardware-dependent data * ensure you can maintain internal sync * think about a VM cluster as an isolateable unit and figure out what your fallback position needs to be when you lose external access Be ready to switch approaches when it turns out that what works in one environment doesn't work in another; ntpdate-from-cron (or ntp-from-cron) is a good approach to have available to switch to, as "better than no sync". Jumping through hoops to satisfy vendors and prove problems lie elsewhere is nothing new, just make sure you can jump back when the time comes. -Phil _______________________________________________ Tech mailing list [email protected] https://lists.lopsa.org/cgi-bin/mailman/listinfo/tech This list provided by the League of Professional System Administrators http://lopsa.org/
