Re: [ceph-users] CRUSH straw2 can not handle big weight differences

Peter Linder Mon, 29 Jan 2018 11:13:50 -0800

I realize we're probably kind of pushing it. It was the only option icould think of however that would satisfy the idea that:


Have separate servers for HDD and NVMe storage spread out in 3 data centers.

Always select 1 NVMe and 2 HDD, in separate data centers (make sure NVMeis primary)

If one data center goes down, only loose 1/3 of the NVMes.

I tried making a ceph rule to first select an NVMe based on class, andthen select 2 HDDs based on class. I couldn't make it guarantee howeverthat they would be in separate data centers probably because of twoseparate chooseleaf statements. Sometimes one of the HDDs would end upbeing in the same one as the NVMe. I did play around with this for sometime.

Just selecting 3 separate ones instead sometimes resulted in 2 or 3NVMes, or no NVMes at all. In fact we do have a separate pool with3xNVMe for the high performance req stuff, but that uses a traditional"default" tree.

Rearranging the osd map and reducing the rule to a single chooseleafseems to work though and we will manually alter the weights outside ofthe hosts to make life easier for CRUSH :).

If we want to add more servers we will just add another layer in betweenand make sure the weights there do not differ too much when we plan it out.


/Peter


Den 2018-01-29 kl. 17:52, skrev Gregory Farnum:

CRUSH is a pseudorandom, probabilistic algorithm. That can lead toproblems with extreme input.

In this case, you've given it a bucket in which one child contains~3.3% of the total weight, and there are only three weights. So ononly 3% of "draws", as it tries to choose a child bucket to descendinto, will it choose that small one first.And then you've forced it to select...each of the hosts in that datacenter, for all inputs? How can that even work in terms of actual datastorage, if some of them are an order of magnitude larger than the others?

Anyway, leaving that bit aside since it looks like you're mapping eachhost to multiple DCs, you're giving CRUSH a very difficult problem tosolve. You can probably "fix" it by turning up the choose_retriesvalue (or whatever it is) to a high enough level that trying to map aPG eventually actually grabs the small host. But I wouldn't be veryconfident in a solution like this; it seems very fragile and subjectto input error.

-Greg

On Mon, Jan 29, 2018 at 6:45 AM Peter Linder<peter.lin...@fiberdirekt.se <mailto:peter.lin...@fiberdirekt.se>> wrote:


    We kind of turned the crushmap inside out a little bit.

    Instead of the traditional "for 1 PG, select OSDs from 3 separate data
    centers" we did "force selection from only one datacenter (out of
    3) and
    leave enough options only to make sure precisely 1 SSD and 2 HDD are
    selected".

    We then organized these "virtual datacenters" in the hierachy so that
    one of them in fact contain 3 options that lead to 3 physically
    separate
    servers in different locations.

    Every physical datacenter has both SSD's and HDD's. The idea is
    that if
    one datacenter is lost, 2/3 of the SSD's still remain (and can be
    mapped
    to by marking the missing ones "out") so performance is maintained.





    Den 2018-01-29 kl. 13:35, skrev Niklas:
    > Yes.
    > It is a hybrid solution where a placement group is always located on
    > one NVMe drive and two HDD drives. Advantage is great read
    performance
    > and cost savings. Disadvantages is low write performance. Still the
    > write performance is good thanks to rockdb on Intel Optane disks in
    > HDD servers.
    >
    > Real world looks more like I described in a previous question
    > (2018-01-23) here on ceph-users list, "Ruleset for optimized Ceph
    > hybrid storage". Nobody answered so am guessing it is not
    possible to
    > create my wanted rule. Now am trying to solve it with virtual
    > datacenters in the crush map. Which works but maybe the the most
    > optimal solution.
    >
    >
    > On 2018-01-29 13:21, Wido den Hollander wrote:
    >>
    >>
    >> On 01/29/2018 01:14 PM, Niklas wrote:
    >>> ...
    >>>
    >>
    >> Is it your intention to put all copies of a object in only one DC?
    >>
    >> What is your exact idea behind this rule? What's the purpose?
    >>
    >> Wido
    >>
    >> _______________________________________________
    >> ceph-users mailing list
    >> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
    >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
    >
    > _______________________________________________
    > ceph-users mailing list
    > ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
    > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

    _______________________________________________
    ceph-users mailing list
    ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
    http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] CRUSH straw2 can not handle big weight differences

Reply via email to