[ceph-users] Re: Network traffic with failure domain datacenter

Anthony D'Atri Fri, 09 May 2025 15:26:39 -0700

Maybe a case for a new-style MSR rule?


> On May 9, 2025, at 5:13 PM, Torkil Svensgaard <tor...@drcmr.dk> wrote:
> 
> 
> 
>> On 08-05-2025 21:14, Peter Linder wrote:
>> There is also the issue that if you have a 4+8 EC pool, you ideally need at 
>> least 4+8 of whatever your failure domain is, in this case DCs. This is more 
>> than most people have.
>> Is this k=4, m=8? What is the benefit of this compared to an ordinary 
>> replicated pool with 3 copies?
> 
> My bad, I think I've misunderstood the definition of a failure domain, it 
> would actually be host.
> 
> We are going to have 2 DCs each with 7+ hosts, and a tiebreaker MON in a 
> third DC. The should allow us to lose one DC and an additional host and still 
> be online.
> 
>> Even if you set the failure domain to, say rack, there is no guarantee that 
>> there is no PG with more than 8 parts in a single DC without some crushmap 
>> trickery.
> 
> We would use crush to ensure the placement we want, something like this:
> 
> rule EC_4_8 {
>        id ZYX
>        type erasure
>        step set_chooseleaf_tries 5
>        step set_choose_tries 100
>        step take default class nvmebulk
>        step choose indep 0 type datacenter
>        step chooseleaf indep 6 type host
>        step emit
> }
> 
>> If this is k=8, m=4, then only 4 failures can be handled and there is no way 
>> to split 12 parts so that both DCs contain 4 or less at the same time.
>> You really need 3 DCs and a fast, highly available network in between.
>> /Peter
>>> Den 2025-05-08 kl. 17:45, skrev Anthony D'Atri:
>>> To be pedantic … backfill usually means copying data in toto, so like 
>>> normal write replication it necessarily has to traverse the WAN.
>>> 
>>> Recovery of just a lost shard/replica in theory with the LRC plugin, but as 
>>> noted that doesn’t seem like a good choice.  With the default EC plugin, 
>>> there *may* be some read locality preference but it’s not something I would 
>>> bank on.
> 
> We looked at the LRC plugin and we don't think it would be worth the risk 
> going with that since it seems somewhat abandoned and not really used by 
> anyone.
> 
>>> Stretch clusters are great when you need zero RPO when you really need a 
>>> single cluster and can manage client endpoint use accordingly.  But with 
>>> tradeoffs, in many cases two clusters with async replication can be a 
>>> better solution, depends on needs and what you’re solving for.
> 
> We did consider two clusters + replication but then we would need more 
> hardware to get the same usable space, and money is scarce.
> 
> The WAN would probably be 2x10G and at a distance of less than 10km. The 
> pools would mainly be bulk storage so I think that should work ok.
> 
> Thanks all.
> 
> Mvh.
> 
> Torkil
> 
>>>> On May 7, 2025, at 5:06 AM, Janne Johansson <icepic...@gmail.com> wrote:
>>>> 
>>>> Den ons 7 maj 2025 kl 10:59 skrev Torkil Svensgaard <tor...@drcmr.dk>:
>>>>> We are looking at a cluster split between two DCs with the DCs as
>>>>> failure domains.
>>>>> 
>>>>> Am I right in assuming that any recovery or backfill taking place should
>>>>> largely happen inside each DC and not between them? Or can no such
>>>>> assumptions be made?
>>>>> Pools would be EC 4+8, if that matters.
>>>> Unless I am mistaken, the first/primary of each PG is the one "doing"
>>>> the backfills, so if the primaries are evenly distributed between the
>>>> sites, the source of all backfills would be in the remote DC in 50% of
>>>> the cases.
>>>> I do not think the backfills are going to calculate how it can use
>>>> only "local" pieces to rebuild a missing/degraded PG piece without
>>>> going over the DC-DC link even if it is theoretically possible.
>>>> 
>>>> --
>>>> May the most significant bit of your life be positive.
>>> It’s good to be 8-bit-clean, if you aren’t , then Kermit can compensate.
>>> 
>>>> _______________________________________________
>>>> ceph-users mailing list -- ceph-users@ceph.io
>>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users@ceph.io
>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
> 
> --
> Torkil Svensgaard
> Sysadmin
> MR-Forskningssektionen, afs. 714
> DRCMR, Danish Research Centre for Magnetic Resonance
> Hvidovre Hospital
> Kettegård Allé 30
> DK-2650 Hvidovre
> Denmark
> Tel: +45 386 22828
> E-mail: tor...@drcmr.dk
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Network traffic with failure domain datacenter

Reply via email to