Re: [ceph-users] Replication vs Erasure Coding with only 2 elementsinthe failure-domain.

Burkhard Linke Tue, 07 Mar 2017 23:06:07 -0800

Hi,


On 03/07/2017 05:53 PM, Francois Blondel wrote:

Hi all,
We have (only) 2 separate "rooms" (crush bucket) and would like tobuild a cluster being able to handle the complete loss of one room.


*snipsnap*

Second idea would be to use Erasure Coding, as it fits our performancerequirements and would use less raw space.
Creating an EC profile like:
“ceph osd erasure-code-profile set eck2m2room k=2 m=2ruleset-failure-domain=room”
and a pool using that EC profile, with “ceph osd pool create ecpool128 128 erasure eck2m2room” of course leads to having “128creating+incomplete” PGs, as we only have 2 rooms.
Is there somehow a way to store the “parity chuncks” (m) on bothrooms, so that the loss of a room would be possible ?
If I understood correctly, an Erasure Coding of for example k=2, m=2,would use the same space as a replication with a size of 2, but bemore reliable, as we could afford the loss of more OSDs at the same time.
Would it be possible to instruct the crush rule to store the first kand m chuncks in room 1, and the second k and m chuncks in room 2 ?

As far as I understand erasure coding there's no special handling forparity or data chunks. To assemble an EC object you just need k chunks,regardless whether they are data or parity chunks.

You should be able to distribute the chunks among two rooms by creatinga new crush rule:


- min_size 4
- max_size 4
- step take <first room>
- step chooseleaf firstn 2 type host
- step emit
- step take <second room>
- step chooseleaf firstn 2 type host
- step emit

I'm not 100% sure about whether chooseleaf is correct or another choosestep is necessary to ensure that two osd from differents hosts arechosen (if necessary). The important point is using two choose-emitcycles and using the correct start points. Just insert the crush labelsfor the rooms.


This approach should work, but it has two drawbacks:

- crash handling

In case of host failing in a room, the PG from that host will bereplicated to another host in the same room. You have to ensure thatthere's enough capacity in each rooms (vs. having enough capacity in thecomplete cluster), which might be tricky.


- bandwidth / host utilization

Almost all ceph based applications/libraries use the 'primary' osd foraccessing data in a PG. The primary OSD is the first one generated bythe crush rule. In the upper example, the primary OSDs will all belocated in the first room. All client traffic will be heading to hostsin that room. Depending on your setup this might not be a desired solution.

Unfortunately I'm not aware of a solution. It would require to replace'step take <first room>' with 'step take <one room>' and 'step take<second room>' with 'step take <a different room>'. Iteration is notpart of crush as far as I know. Maybe someone else can give some moreinsight into this.


Regards,
Burkhard

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Replication vs Erasure Coding with only 2 elementsinthe failure-domain.

Reply via email to