Hi Gregory,

Thanks for your answer.

I had to add another step emit to your suggestion to make it work:

step take default
step chooseleaf indep 4 type host
step emit
step take default
step chooseleaf indep 4 type host
step emit

However, now the same OSD is chosen twice for every PG:

# crushtool --test -i compiled-crushmap-new --rule 1 --show-mappings --x 1
--num-rep 8
CRUSH rule 1 x 1 [5,9,3,12,5,9,3,12]

I'm wondering why something like this won't work (crushtool test ends up
empty):

step take default
step chooseleaf indep 4 type host
step choose indep 2 type osd
step emit


# crushtool --test -i compiled-crushmap-new --rule 1 --show-mappings --x 1
--num-rep 8
CRUSH rule 1 x 1 []

Kind regards,
Caspar Smit

2018-02-02 19:09 GMT+01:00 Gregory Farnum <gfar...@redhat.com>:

> On Fri, Feb 2, 2018 at 8:13 AM, Caspar Smit <caspars...@supernas.eu>
> wrote:
> > Hi all,
> >
> > I'd like to setup a small cluster (5 nodes) using erasure coding. I would
> > like to use k=5 and m=3.
> > Normally you would need a minimum of 8 nodes (preferably 9 or more) for
> > this.
> >
> > Then i found this blog:
> > https://ceph.com/planet/erasure-code-on-small-clusters/
> >
> > This sounded ideal to me so i started building a test setup using the 5+3
> > profile
> >
> > Changed the erasure ruleset to:
> >
> > rule erasure_ruleset {
> >   ruleset X
> >   type erasure
> >   min_size 8
> >   max_size 8
> >   step take default
> >   step choose indep 4 type host
> >   step choose indep 2 type osd
> >   step emit
> > }
> >
> > Created a pool and now every PG has 8 shards in 4 hosts with 2 shards
> each,
> > perfect.
> >
> > But then i tested a node failure, no problem again, all PG's stay active
> > (most undersized+degraded, but still active). Then after 10 minutes the
> > OSD's on the failed node were all marked as out, as expected.
> >
> > I waited for the data to be recovered to the other (fifth) node but that
> > doesn't happen, there is no recovery whatsoever.
> >
> > Only when i completely remove the down+out OSD's from the cluster the
> data
> > is recovered.
> >
> > My guess is that the "step choose indep 4 type host" chooses 4 hosts
> > beforehand to store data on.
>
> Hmm, basically, yes. The basic process is:
>
> >   step take default
>
> take the default root.
>
> >   step choose indep 4 type host
>
> Choose four hosts that exist under the root. *Note that at this layer,
> it has no idea what OSDs exist under the hosts.*
>
> >   step choose indep 2 type osd
>
> Within the host chosen above, choose two OSDs.
>
>
> Marking out an OSD does not change the weight of its host, because
> that causes massive data movement across the whole cluster on a single
> disk failure. The "chooseleaf" commands deal with this (because if
> they fail to pick an OSD within the host, they will back out and go
> for a different host), but that doesn't work when you're doing
> independent "choose" steps.
>
> I don't remember the implementation details well enough to be sure,
> but you *might* be able to do something like
>
> step take default
> step chooseleaf indep 4 type host
> step take default
> step chooseleaf indep 4 type host
> step emit
>
> And that will make sure you get at least 4 OSDs involved?
> -Greg
>
> >
> > Would it be possible to do something like this:
> >
> > Create a 5+3 EC profile, every hosts has a maximum of 2 shards (so 4
> hosts
> > are needed), in case of node failure -> recover data from failed node to
> > fifth node.
> >
> > Thank you in advance,
> > Caspar
> >
> >
> >
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to