> understood. So stretched pools need also a stretched ceph cluster.

The docs are a bit confusing, they refer to a stretched pool in a cluster that 
is not explicitly in stretch mode.  We should probably not use “stretch” to 
describe anything that isn’t in a formal stretch mode cluster, as setting 
stretch mode affects behavior in certain ways.

> So a simple setup would be with replication size 3 for replicated pools and
> 3 or more ceph monitors, ...

We want at least 2x mons per site + tiebreaker, so that not only can we form 
quorum, but that the cluster can operate if one crashes.

>> 
>> The reason behind having 3 datacenters is because we are having alot of
>> k8s clusters which also need to have quorum, if i distribute the etcd nodes
>> across 3 datacenters, the outage of one datacenter will keep the k8s
>> cluster operational.

I think with K8s you could employ a strategy similar to Ceph’s stretch mode:

* K8s workers and OSDs at *2* sites
* 2x K8s control nodes + 1x Ceph mon at a tiebreaker site, which could even be 
just cloud VMs.

That way the Ceph pools would only need R4 instead of R6.


>> 
>> The latency between the datacenters is most likely very low (we can not
>> measure since i am in planning stages.

I know of one commercial Ceph support organization that dictates < 10ms RTT 
between OSD sites and < 100ms RTT to a tiebreaker mon.  These thresholds might 
inform decisions and predictions.

A quick web search asserts:

>A rule of thumb is that RTT increases by approximately 1 millisecond (ms) for 
>every 60 miles of distance.

The nuance to the formal stretch mode is the difference in how mon quorum is 
managed using reachability scores, and the automatic management of pools` 
min_size in order to maintain an operable cluster in the face of an entire DC 
going down.  With a conventional cluster, if you have say 2 mons in one DC and 
3 in another, loss of the second DC will result in an inoperable cluster unless 
one takes manual drastic action.  

>> The connections between the datacenters are on dark fibres connected
>> through modules directly in the Top of the Rack switches, compared to the
>> local connectivity it will be almost the same.
>> We have an existing similar setup between 2 datacenters where the WAN
>> connection add below 1ms latency.

The same two DCs as would be in operation here?

>> 
>> On "exceptionally large nodes", those are all identical servers, 3 per
>> datacenter with 16 x 3.84 TB nvme disks, 128 AMD Epyc cores (on 2 sockets)
>> and 1.5 TB memory, i would not count them as "exceptionally large".

Gotcha.  Is this a converged cluster?  That’s a excess of cores and RAM just 
for Ceph if not.

>> 
>> I will read up a little more on asych replication.

RGW: multisite
RBD: rbd-mirror
CephFS: mirroring is fairly recent

Part of the equation is having the clients be able to access the data, 
including if you’re solving for zero data *unavailability* vs zero *loss*.  The 
latter is much easier than the former.




>> 

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to