[ceph-users] Re: Stretched pool or not ?

Soeren Malchow Mon, 28 Apr 2025 07:19:51 -0700

HI,

the distance between the datacenters does not exceed 25km ( 15 miles ), current 
2 DC setup is with one different datacenter provider but the same Dark Fibre 
provider.


And yes, the clusters are hyperconverged, they are running proxmox. And the 
kubernetes nodes will run as VMs on top of Proxmox/KVM, our experience is, that 
2 kubernetes control planes are not a good choice, you should have 3 or 5 to 
get a proper quorum.

I read up on the quorum for the mon nodes in streched mode.

In regards to unavailability vs. zero dataloss, we definitly prefer zero 
dataloss, but depending on the setup we are willing to make compromises and 
risk having to restore machines from backup.

I will come up with a testplan (including simulating load) and try to test 
different scenarios and share the outcome

Cheers
Soeren


________________________________
From: Anthony D'Atri <anthony.da...@gmail.com>
Sent: Monday, April 28, 2025 3:09 PM
To: Joachim Kraftmayer <joachim.kraftma...@clyso.com>
Cc: Soeren Malchow <soeren.malc...@convotis.com>; ceph-users@ceph.io 
<ceph-users@ceph.io>
Subject: Re: [ceph-users] Stretched pool or not ?



understood. So stretched pools need also a stretched ceph cluster.

The docs are a bit confusing, they refer to a stretched pool in a cluster that 
is not explicitly in stretch mode.  We should probably not use “stretch” to 
describe anything that isn’t in a formal stretch mode cluster, as setting 
stretch mode affects behavior in certain ways.

So a simple setup would be with replication size 3 for replicated pools and
3 or more ceph monitors, ...

We want at least 2x mons per site + tiebreaker, so that not only can we form 
quorum, but that the cluster can operate if one crashes.


The reason behind having 3 datacenters is because we are having alot of
k8s clusters which also need to have quorum, if i distribute the etcd nodes
across 3 datacenters, the outage of one datacenter will keep the k8s
cluster operational.

I think with K8s you could employ a strategy similar to Ceph’s stretch mode:

* K8s workers and OSDs at *2* sites
* 2x K8s control nodes + 1x Ceph mon at a tiebreaker site, which could even be 
just cloud VMs.

That way the Ceph pools would only need R4 instead of R6.



The latency between the datacenters is most likely very low (we can not
measure since i am in planning stages.

I know of one commercial Ceph support organization that dictates < 10ms RTT 
between OSD sites and < 100ms RTT to a tiebreaker mon.  These thresholds might 
inform decisions and predictions.

A quick web search asserts:

>A rule of thumb is that RTT increases by approximately 1 millisecond (ms) for 
>every 60 miles of distance.

The nuance to the formal stretch mode is the difference in how mon quorum is 
managed using reachability scores, and the automatic management of pools` 
min_size in order to maintain an operable cluster in the face of an entire DC 
going down.  With a conventional cluster, if you have say 2 mons in one DC and 
3 in another, loss of the second DC will result in an inoperable cluster unless 
one takes manual drastic action.

The connections between the datacenters are on dark fibres connected
through modules directly in the Top of the Rack switches, compared to the
local connectivity it will be almost the same.
We have an existing similar setup between 2 datacenters where the WAN
connection add below 1ms latency.

The same two DCs as would be in operation here?


On "exceptionally large nodes", those are all identical servers, 3 per
datacenter with 16 x 3.84 TB nvme disks, 128 AMD Epyc cores (on 2 sockets)
and 1.5 TB memory, i would not count them as "exceptionally large".

Gotcha.  Is this a converged cluster?  That’s a excess of cores and RAM just 
for Ceph if not.


I will read up a little more on asych replication.

RGW: multisite
RBD: rbd-mirror
CephFS: mirroring is fairly recent

Part of the equation is having the clients be able to access the data, 
including if you’re solving for zero data *unavailability* vs zero *loss*.  The 
latter is much easier than the former.






_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Stretched pool or not ?

Reply via email to