On Mon, May 21, 2018 at 11:19 AM Andras Pataki < apat...@flatironinstitute.org> wrote:
> Hi Greg, > > Thanks for the detailed explanation - the examples make a lot of sense. > > One followup question regarding a two level crush rule like: > > > step take default > step choose 3 type=rack > step chooseleaf 3 type=host > step emit > > If the erasure code has 9 chunks, this lines up exactly without any > problems. What if the erasure code isn't a nice product of the racks and > hosts/rack, for example 6+2 with the above example? Will it just take 3 > chunks in the first two racks and 2 from the last without any issues? > Yes, assuming your ceph install is new enough. (At one point it crashed if you did that :o) The other direction I presume can't work, i.e. on the above example I can't > put any erasure code with more than 9 chunks. > Right > > Andras > > > > On 05/18/2018 06:30 PM, Gregory Farnum wrote: > > On Thu, May 17, 2018 at 9:05 AM Andras Pataki < > apat...@flatironinstitute.org> wrote: > >> I've been trying to wrap my head around crush rules, and I need some >> help/advice. I'm thinking of using erasure coding instead of >> replication, and trying to understand the possibilities for planning for >> failure cases. >> >> For a simplified example, consider a 2 level topology, OSDs live on >> hosts, and hosts live in racks. I'd like to set up a rule for a 6+3 >> erasure code that would put at most 1 of the 9 chunks on a host, and no >> more than 3 chunks in a rack (so in case the rack is lost, we still have >> a way to recover). Some racks may not have 3 hosts in them, so they >> could potentially accept only 1 or 2 chunks then. How can something >> like this be implemented as a crush rule? Or, if not exactly this, >> something in this spirit? I don't want to say that all chunks need to >> live in a separate rack because that is too restrictive (some racks may >> be much bigger than others, or there might not even be 9 racks). >> > > Unfortunately what you describe here is a little too detailed in ways > CRUSH can't easily specify. You should think of a CRUSH rule as a sequence > of steps that start out at a root (the "take" step), and incrementally > specify more detail about which piece of the CRUSH hierarchy they run on, > but run the *same* rule on every piece they select. > > So the simplest thing that comes close to what you suggest is: > (forgive me if my syntax is slightly off, I'm doing this from memory) > step take default > step chooseleaf n type=rack > step emit > > That would start at the default root, select "n" racks (9, in your case) > and then for each rack find an OSD within it. (chooseleaf is special and > more flexibly than most of the CRUSH language; it's nice because if it > can't find an OSD in one of the selected racks, it will pick another rack). > But a rule that's more illustrative of how things work is: > step take default > step choose 3 type=rack > step chooseleaf 3 type=host > step emit > > That one selects three racks, then selects three OSDs within different > hosts *in each rack*. (You'll note that it doesn't necessarily work out so > well if you don't want 9 OSDs!) If one of the racks it selected doesn't > have 3 separate hosts...well, tough, it tried to do what you told it. :/ > > If you were dedicated, you could split up your racks into > equivalently-sized units — let's say rows. Then you could do > step take default > step choose 3 type=row > step chooseleaf 3 type=host > step emit > > Assuming you have 3+ rows of good size, that'll get you 9 OSDs which are > all on different hosts. > -Greg > > >> >> Thanks, >> >> Andras >> >> _______________________________________________ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com