[ceph-users] Re: Network traffic with failure domain datacenter

Torkil Svensgaard Sat, 10 May 2025 02:57:25 -0700

On 10-05-2025 00:25, Anthony D'Atri wrote:

Maybe a case for a new-style MSR rule?

Interesting, I wasn't aware of that new feature. We are on squid but wedo have some older clients we need to to something about first.


Thanks.

Mvh.

Torkil

On May 9, 2025, at 5:13 PM, Torkil Svensgaard <tor...@drcmr.dk> wrote:

On 08-05-2025 21:14, Peter Linder wrote:
There is also the issue that if you have a 4+8 EC pool, you ideally need at 
least 4+8 of whatever your failure domain is, in this case DCs. This is more 
than most people have.
Is this k=4, m=8? What is the benefit of this compared to an ordinary 
replicated pool with 3 copies?


My bad, I think I've misunderstood the definition of a failure domain, it would 
actually be host.

We are going to have 2 DCs each with 7+ hosts, and a tiebreaker MON in a third 
DC. The should allow us to lose one DC and an additional host and still be 
online.

Even if you set the failure domain to, say rack, there is no guarantee that 
there is no PG with more than 8 parts in a single DC without some crushmap 
trickery.


We would use crush to ensure the placement we want, something like this:

rule EC_4_8 {
        id ZYX
        type erasure
        step set_chooseleaf_tries 5
        step set_choose_tries 100
        step take default class nvmebulk
        step choose indep 0 type datacenter
        step chooseleaf indep 6 type host
        step emit
}

If this is k=8, m=4, then only 4 failures can be handled and there is no way to 
split 12 parts so that both DCs contain 4 or less at the same time.
You really need 3 DCs and a fast, highly available network in between.
/Peter

Den 2025-05-08 kl. 17:45, skrev Anthony D'Atri:
To be pedantic … backfill usually means copying data in toto, so like normal 
write replication it necessarily has to traverse the WAN.

Recovery of just a lost shard/replica in theory with the LRC plugin, but as 
noted that doesn’t seem like a good choice.  With the default EC plugin, there 
*may* be some read locality preference but it’s not something I would bank on.


We looked at the LRC plugin and we don't think it would be worth the risk going 
with that since it seems somewhat abandoned and not really used by anyone.

Stretch clusters are great when you need zero RPO when you really need a single 
cluster and can manage client endpoint use accordingly.  But with tradeoffs, in 
many cases two clusters with async replication can be a better solution, 
depends on needs and what you’re solving for.


We did consider two clusters + replication but then we would need more hardware 
to get the same usable space, and money is scarce.

The WAN would probably be 2x10G and at a distance of less than 10km. The pools 
would mainly be bulk storage so I think that should work ok.

Thanks all.

Mvh.

Torkil

On May 7, 2025, at 5:06 AM, Janne Johansson <icepic...@gmail.com> wrote:

Den ons 7 maj 2025 kl 10:59 skrev Torkil Svensgaard <tor...@drcmr.dk>:

We are looking at a cluster split between two DCs with the DCs as
failure domains.

Am I right in assuming that any recovery or backfill taking place should
largely happen inside each DC and not between them? Or can no such
assumptions be made?
Pools would be EC 4+8, if that matters.

Unless I am mistaken, the first/primary of each PG is the one "doing"
the backfills, so if the primaries are evenly distributed between the
sites, the source of all backfills would be in the remote DC in 50% of
the cases.
I do not think the backfills are going to calculate how it can use
only "local" pieces to rebuild a missing/degraded PG piece without
going over the DC-DC link even if it is theoretically possible.

--
May the most significant bit of your life be positive.

It’s good to be 8-bit-clean, if you aren’t , then Kermit can compensate.

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


--
Torkil Svensgaard
Sysadmin
MR-Forskningssektionen, afs. 714
DRCMR, Danish Research Centre for Magnetic Resonance
Hvidovre Hospital
Kettegård Allé 30
DK-2650 Hvidovre
Denmark
Tel: +45 386 22828
E-mail: tor...@drcmr.dk
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


--
Torkil Svensgaard
Sysadmin
MR-Forskningssektionen, afs. 714
DRCMR, Danish Research Centre for Magnetic Resonance
Hvidovre Hospital
Kettegård Allé 30
DK-2650 Hvidovre
Denmark
Tel: +45 386 22828
E-mail: tor...@drcmr.dk
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Network traffic with failure domain datacenter

Reply via email to