Hi Gregory, Thanks for your answer.
I had to add another step emit to your suggestion to make it work: step take default step chooseleaf indep 4 type host step emit step take default step chooseleaf indep 4 type host step emit However, now the same OSD is chosen twice for every PG: # crushtool --test -i compiled-crushmap-new --rule 1 --show-mappings --x 1 --num-rep 8 CRUSH rule 1 x 1 [5,9,3,12,5,9,3,12] I'm wondering why something like this won't work (crushtool test ends up empty): step take default step chooseleaf indep 4 type host step choose indep 2 type osd step emit # crushtool --test -i compiled-crushmap-new --rule 1 --show-mappings --x 1 --num-rep 8 CRUSH rule 1 x 1 [] Kind regards, Caspar Smit 2018-02-02 19:09 GMT+01:00 Gregory Farnum <gfar...@redhat.com>: > On Fri, Feb 2, 2018 at 8:13 AM, Caspar Smit <caspars...@supernas.eu> > wrote: > > Hi all, > > > > I'd like to setup a small cluster (5 nodes) using erasure coding. I would > > like to use k=5 and m=3. > > Normally you would need a minimum of 8 nodes (preferably 9 or more) for > > this. > > > > Then i found this blog: > > https://ceph.com/planet/erasure-code-on-small-clusters/ > > > > This sounded ideal to me so i started building a test setup using the 5+3 > > profile > > > > Changed the erasure ruleset to: > > > > rule erasure_ruleset { > > ruleset X > > type erasure > > min_size 8 > > max_size 8 > > step take default > > step choose indep 4 type host > > step choose indep 2 type osd > > step emit > > } > > > > Created a pool and now every PG has 8 shards in 4 hosts with 2 shards > each, > > perfect. > > > > But then i tested a node failure, no problem again, all PG's stay active > > (most undersized+degraded, but still active). Then after 10 minutes the > > OSD's on the failed node were all marked as out, as expected. > > > > I waited for the data to be recovered to the other (fifth) node but that > > doesn't happen, there is no recovery whatsoever. > > > > Only when i completely remove the down+out OSD's from the cluster the > data > > is recovered. > > > > My guess is that the "step choose indep 4 type host" chooses 4 hosts > > beforehand to store data on. > > Hmm, basically, yes. The basic process is: > > > step take default > > take the default root. > > > step choose indep 4 type host > > Choose four hosts that exist under the root. *Note that at this layer, > it has no idea what OSDs exist under the hosts.* > > > step choose indep 2 type osd > > Within the host chosen above, choose two OSDs. > > > Marking out an OSD does not change the weight of its host, because > that causes massive data movement across the whole cluster on a single > disk failure. The "chooseleaf" commands deal with this (because if > they fail to pick an OSD within the host, they will back out and go > for a different host), but that doesn't work when you're doing > independent "choose" steps. > > I don't remember the implementation details well enough to be sure, > but you *might* be able to do something like > > step take default > step chooseleaf indep 4 type host > step take default > step chooseleaf indep 4 type host > step emit > > And that will make sure you get at least 4 OSDs involved? > -Greg > > > > > Would it be possible to do something like this: > > > > Create a 5+3 EC profile, every hosts has a maximum of 2 shards (so 4 > hosts > > are needed), in case of node failure -> recover data from failed node to > > fifth node. > > > > Thank you in advance, > > Caspar > > > > > > > > _______________________________________________ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com