Hi all, sorry for the delay in posting this.
IntroductioN: At LPC 2010, we discussed (once more) that a key feature for pacemaker in 2011 would be improved support for multi-site clusters; by multi-site, we mean two (or more) sites with a local cluster each, and some higher level entity coordinating fail-over across these (as opposed to "stretched" clusters, where a single cluster might spawn the whole campus in the city). Typically, such multi-site environments are also too far apart to support synchronous communication/replication. There are several aspects to this that we discussed; Andrew and I first described and wrote this out a few years ago, so I hope he can remember the rest ;-) "Tokens" are, essentially, cluster-wide attributes (similar to node attributes, just for the whole partition). Via dependencies (similar to rsc_location), one can specify that certain resources require a specific token to be set before being started (and, vice versa, need to be stopped if the token is cleared). You could also think of our current "quorum" as a special, cluster-wide token that is granted in case of node majority. The token thus would be similar to a "site quorum"; i.e., the permission to manage/own resources associated with that site, which would be recorded in a rsc dependency. (It'd probably make a lot of sense if this would support resource sets, so one can easily list all the resources; also, some resources like m/s may tie their role to token ownership.) These tokens can be granted/revoked either manually (which I actually expect will be the default for the classic enterprise clusters), or via an automated mechanism described further below. Another aspect to site fail-over is recovery speed. A site can only activate the resources safely if it can be sure that the other site has deactivated them. Waiting for them to shutdown "cleanly" could incur very high latency (think "cascaded stop delays"). So, it would be desirable if this could be short-circuited. The idea between Andrew and myself was to introduce the concept of a "dead man" dependency; if the origin goes away, nodes which host dependent resources are fenced, immensely speeding up recovery. It seems to make most sense to make this an attribute of some sort for the various dependencies that we already have, possibly, to make this generally available. (It may also be something admins want to temporarily disable - i.e., for a graceful switch-over, they may not want to trigger the dead man process always.) The next bit is what we called the "Cluster Token Registry"; for those scenarios where the site switch is supposed to be automatic (instead of the admin revoking the token somewhere, waiting for everything to stop, and then granting it on the desired site). The participating clusters would run a daemon/service that would connect to each other, exchange information on their connectivity details (though conceivably, not mere majority is relevant, but also current ownership, admin weights, time of day, capacity ...), and vote on which site gets which token(s); a token would only be granted to a site once they can be sure that it has been relinquished by the previous owner, which would need to be implemented via a timer in most scenarios (see the dead man flag). Further, sites which lose the vote (either explicitly or implicitly by being disconnected from the voting body) would obviously need to perform said release after a sane time-out (to protect against brief connection issues). A final component is an idea to ease administration and management of such environments. The dependencies allow an automated tool to identify which resources are affected by a given token, and this could be automatically replicated (and possibly transformed) between sites, to ensure that all sites have an uptodate configuration of relevant resources. This would be handled by yet another extension, a CIB replicator service (that would either run permanently or explicitly when the admin calls it). Conceivably, the "inactive" resources may not even be present in the active CIB of sites which don't own the token (and be inserted once token ownership is established). This may be an (optional) interesting feature to keep CIB sizes under control. Andrew, is that about what we discussed? Any comments from anyone else? Did I capture what we spoke about at LPC? Regards, Lars -- Architect Storage/HA, OPS Engineering, Novell, Inc. SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg) "Experience is the name everyone gives to their mistakes." -- Oscar Wilde _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker