> understood. So stretched pools need also a stretched ceph cluster.
The docs are a bit confusing, they refer to a stretched pool in a cluster that is not explicitly in stretch mode. We should probably not use “stretch” to describe anything that isn’t in a formal stretch mode cluster, as setting stretch mode affects behavior in certain ways. > So a simple setup would be with replication size 3 for replicated pools and > 3 or more ceph monitors, ... We want at least 2x mons per site + tiebreaker, so that not only can we form quorum, but that the cluster can operate if one crashes. >> >> The reason behind having 3 datacenters is because we are having alot of >> k8s clusters which also need to have quorum, if i distribute the etcd nodes >> across 3 datacenters, the outage of one datacenter will keep the k8s >> cluster operational. I think with K8s you could employ a strategy similar to Ceph’s stretch mode: * K8s workers and OSDs at *2* sites * 2x K8s control nodes + 1x Ceph mon at a tiebreaker site, which could even be just cloud VMs. That way the Ceph pools would only need R4 instead of R6. >> >> The latency between the datacenters is most likely very low (we can not >> measure since i am in planning stages. I know of one commercial Ceph support organization that dictates < 10ms RTT between OSD sites and < 100ms RTT to a tiebreaker mon. These thresholds might inform decisions and predictions. A quick web search asserts: >A rule of thumb is that RTT increases by approximately 1 millisecond (ms) for >every 60 miles of distance. The nuance to the formal stretch mode is the difference in how mon quorum is managed using reachability scores, and the automatic management of pools` min_size in order to maintain an operable cluster in the face of an entire DC going down. With a conventional cluster, if you have say 2 mons in one DC and 3 in another, loss of the second DC will result in an inoperable cluster unless one takes manual drastic action. >> The connections between the datacenters are on dark fibres connected >> through modules directly in the Top of the Rack switches, compared to the >> local connectivity it will be almost the same. >> We have an existing similar setup between 2 datacenters where the WAN >> connection add below 1ms latency. The same two DCs as would be in operation here? >> >> On "exceptionally large nodes", those are all identical servers, 3 per >> datacenter with 16 x 3.84 TB nvme disks, 128 AMD Epyc cores (on 2 sockets) >> and 1.5 TB memory, i would not count them as "exceptionally large". Gotcha. Is this a converged cluster? That’s a excess of cores and RAM just for Ceph if not. >> >> I will read up a little more on asych replication. RGW: multisite RBD: rbd-mirror CephFS: mirroring is fairly recent Part of the equation is having the clients be able to access the data, including if you’re solving for zero data *unavailability* vs zero *loss*. The latter is much easier than the former. >> _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io