But yeah, if you're going with 2 or 4, you'll be better off with 3 or 5.  As
long as you don't go with 1 you should be okay.

On a recent panel discussion one member strongly advocated 5 as the
minimum number of MONs for a large Ceph deployment. Large in this case
was PBs of storage.

For a Ceph cluster with 100s of OSDs and 100s of TB across multiple
racks (therefore many paths involved) is 5 x MONs a good rule-of-thumb
or is three sufficient?

Whoever stated that was probably right. I don't often like to speak about what works best for (really) large deployments as I don't often see them. In theory, 5 monitors will fare better than 3 for 100s of OSDs.

As far as the monitors are concerned, this will be so mostly because 5 monitors are able to serve more maps concurrently than 3 monitors would. I don't think we have tests to back my reasoning here, but I don't think that the cluster workload or its size would have that much of an impact on the number of monitors. Albeit a technical detail, the fact is that every message that an OSD would send to a monitor that would trigger an update to a map is *always* forwarded to the leader monitor. This means that regardless of how many monitors you have, you'll always end up with the same monitor dealing with the map updates and that always puts a cap on map update throughput -- this is not that big of a deal, usually, and knobs may be adjusted if need be.

On the other hand, given you have 5 monitors instead of 3 means that you'll be able to spread OSD connections throughout more monitors, and even if updates are forwarded to the leader, connection-wise the load is more spread out -- the message is forwarded by the monitor the OSD connects to, and said monitor will act as a proxy in replying to the OSD, so there's less hammering the leader directly.

But the point where this actually may make a real difference is in serving osdmap updates. So, the OSDs need those updates. Even considering that OSDs will share maps amongst themselves, they still need to get them from somewhere -- and that somewhere is the monitor cluster. If you have 100s of OSDs connected to just 3 monitors, each monitor will end up serving bunches of reads (sending map updates to OSDs) while dealing with messages that will trigger map updates (which will in turn be forwarded to the leader). Given that any client (OSDs included) connect to monitors at random at start and maintain that connection for a while, a "rule of thumb" would tell us that the leader would be responsible for serving 1/3 of all map reads while still handling map updates. Having 5 monitors would reduce this load to 1/5.

However, I don't know of a good indicator to whether a given cluster should go with 5 monitors instead of 3. Or 7 monitors instead of 5. I don't think there are many clusters running 7 monitors, but it may so be that for even larger clusters, having 5 or 7 monitors serving updates makes up for the increased number of messages required to commit an update -- keep in mind that due to Paxos nature one always needs an ack for an update from at least (N+1)/2 monitors. Again, this is twofold: we may have more messages being passed around, but given each monitor is under a lower load we may even get to them faster.

I think I went a bit offtrack.

Let me know if this led to further confusion instead.


