Maybe a case for a new-style MSR rule?
> On May 9, 2025, at 5:13 PM, Torkil Svensgaard <tor...@drcmr.dk> wrote: > > > >> On 08-05-2025 21:14, Peter Linder wrote: >> There is also the issue that if you have a 4+8 EC pool, you ideally need at >> least 4+8 of whatever your failure domain is, in this case DCs. This is more >> than most people have. >> Is this k=4, m=8? What is the benefit of this compared to an ordinary >> replicated pool with 3 copies? > > My bad, I think I've misunderstood the definition of a failure domain, it > would actually be host. > > We are going to have 2 DCs each with 7+ hosts, and a tiebreaker MON in a > third DC. The should allow us to lose one DC and an additional host and still > be online. > >> Even if you set the failure domain to, say rack, there is no guarantee that >> there is no PG with more than 8 parts in a single DC without some crushmap >> trickery. > > We would use crush to ensure the placement we want, something like this: > > rule EC_4_8 { > id ZYX > type erasure > step set_chooseleaf_tries 5 > step set_choose_tries 100 > step take default class nvmebulk > step choose indep 0 type datacenter > step chooseleaf indep 6 type host > step emit > } > >> If this is k=8, m=4, then only 4 failures can be handled and there is no way >> to split 12 parts so that both DCs contain 4 or less at the same time. >> You really need 3 DCs and a fast, highly available network in between. >> /Peter >>> Den 2025-05-08 kl. 17:45, skrev Anthony D'Atri: >>> To be pedantic … backfill usually means copying data in toto, so like >>> normal write replication it necessarily has to traverse the WAN. >>> >>> Recovery of just a lost shard/replica in theory with the LRC plugin, but as >>> noted that doesn’t seem like a good choice. With the default EC plugin, >>> there *may* be some read locality preference but it’s not something I would >>> bank on. > > We looked at the LRC plugin and we don't think it would be worth the risk > going with that since it seems somewhat abandoned and not really used by > anyone. > >>> Stretch clusters are great when you need zero RPO when you really need a >>> single cluster and can manage client endpoint use accordingly. But with >>> tradeoffs, in many cases two clusters with async replication can be a >>> better solution, depends on needs and what you’re solving for. > > We did consider two clusters + replication but then we would need more > hardware to get the same usable space, and money is scarce. > > The WAN would probably be 2x10G and at a distance of less than 10km. The > pools would mainly be bulk storage so I think that should work ok. > > Thanks all. > > Mvh. > > Torkil > >>>> On May 7, 2025, at 5:06 AM, Janne Johansson <icepic...@gmail.com> wrote: >>>> >>>> Den ons 7 maj 2025 kl 10:59 skrev Torkil Svensgaard <tor...@drcmr.dk>: >>>>> We are looking at a cluster split between two DCs with the DCs as >>>>> failure domains. >>>>> >>>>> Am I right in assuming that any recovery or backfill taking place should >>>>> largely happen inside each DC and not between them? Or can no such >>>>> assumptions be made? >>>>> Pools would be EC 4+8, if that matters. >>>> Unless I am mistaken, the first/primary of each PG is the one "doing" >>>> the backfills, so if the primaries are evenly distributed between the >>>> sites, the source of all backfills would be in the remote DC in 50% of >>>> the cases. >>>> I do not think the backfills are going to calculate how it can use >>>> only "local" pieces to rebuild a missing/degraded PG piece without >>>> going over the DC-DC link even if it is theoretically possible. >>>> >>>> -- >>>> May the most significant bit of your life be positive. >>> It’s good to be 8-bit-clean, if you aren’t , then Kermit can compensate. >>> >>>> _______________________________________________ >>>> ceph-users mailing list -- ceph-users@ceph.io >>>> To unsubscribe send an email to ceph-users-le...@ceph.io >>> _______________________________________________ >>> ceph-users mailing list -- ceph-users@ceph.io >>> To unsubscribe send an email to ceph-users-le...@ceph.io >> _______________________________________________ >> ceph-users mailing list -- ceph-users@ceph.io >> To unsubscribe send an email to ceph-users-le...@ceph.io > > -- > Torkil Svensgaard > Sysadmin > MR-Forskningssektionen, afs. 714 > DRCMR, Danish Research Centre for Magnetic Resonance > Hvidovre Hospital > Kettegård Allé 30 > DK-2650 Hvidovre > Denmark > Tel: +45 386 22828 > E-mail: tor...@drcmr.dk > _______________________________________________ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io