Hi Konstantin,

I could only dream of reading this answer! Thank you so much!!!

Regards,
Cody


On Tue, Aug 21, 2018 at 8:50 AM Konstantin Shalygin <k0...@k0ste.ru> wrote:
>
> On 08/20/2018 08:15 PM, Cody wrote:
>
> Hi Konstantin,
>
> Thank you for looking into my question.
>
> I was trying to understand how to set up CRUSH hierarchies and set
> rules for different failure domains. I am particularly confused by the
> 'step take' and 'step choose|chooseleaf' settings for which I think
> are the keys for defining a failure domain in a CRUSH rule.
>
> As for my hypothetical cluster, it is made of 3 racks with 2 hosts on
> each. One host has 3 SSD-based OSDs and the other has 3 HDD-based
> OSDs. I wished to create two rules: one uses SSD-only and another
> HDD-only. Both rules should have a rack level failure domain.
>
> I have attached a diagram that may help to explain my setup. The
> following is my CRUSH map configuration (with all typos fixed) for
> review:
>
> device 0 osd.0 class ssd
> device 1 osd.1 class ssd
> device 2 osd.2 class ssd
> device 3 osd.3 class hdd
> device 4 osd.4 class hdd
> device 5 osd.5 class hdd
> device 6 osd.6 class ssd
> device 7 osd.7 class ssd
> device 8 osd.8 class ssd
> device 9 osd.9 class hdd
> device 10 osd.10 class hdd
> device 11 osd.11 class hdd
> device 12 osd.12 class ssd
> device 13 osd.13 class ssd
> device 14 osd.14 class ssd
> device 15 osd.15 class hdd
> device 16 osd.17 class hdd
> device 17 osd.17 class hdd
>
>   host a1-1 {
>       id -1
>       alg straw
>       hash 0
>       item osd.0 weight 1.00
>       item osd.1 weight 1.00
>       item osd.2 weight 1.00
>   }
>
>   host a1-2 {
>       id -2
>       alg straw
>       hash 0
>       item osd.3 weight 1.00
>       item osd.4 weight 1.00
>       item osd.5 weight 1.00
>   }
>
>   host a2-1 {
>       id -3
>       alg straw
>       hash 0
>       item osd.6 weight 1.00
>       item osd.7 weight 1.00
>       item osd.8 weight 1.00
>   }
>
>   host a2-2 {
>       id -4
>       alg straw
>       hash 0
>       item osd.9 weight 1.00
>       item osd.10 weight 1.00
>       item osd.11 weight 1.00
>   }
>
>   host a3-1 {
>       id -5
>       alg straw
>       hash 0
>       item osd.12 weight 1.00
>       item osd.13 weight 1.00
>       item osd.14 weight 1.00
>   }
>
>   host a3-2 {
>       id -6
>       alg straw
>       hash 0
>       item osd.15 weight 1.00
>       item osd.16 weight 1.00
>       item osd.17 weight 1.00
>   }
>
>   rack a1 {
>       id -7
>       alg straw
>       hash 0
>       item a1-1 weight 3.0
>       item a1-2 weight 3.0
>   }
>
>   rack a2 {
>       id -5
>       alg straw
>       hash 0
>       item a2-1 weight 3.0
>       item a2-2 weight 3.0
>   }
>
>   rack a3 {
>       id -6
>       alg straw
>       hash 0
>       item a3-1 weight 3.0
>       item a3-2 weight 3.0
>   }
>
>   row a {
>       id -7
>       alg straw
>       hash 0
>       item a1 6.0
>       item a2 6.0
>       item a3 6.0
>   }
>
>   rule ssd {
>       id 1
>       type replicated
>       min_size 2
>       max_size 11
>       step take a class ssd
>       step chooseleaf firstn 0 type rack
>       step emit
>   }
>
>   rule hdd {
>       id 2
>       type replicated
>       min_size 2
>       max_size 11
>       step take a class hdd
>       step chooseleaf firstn 0 type rack
>       step emit
>   }
>
>
> Are the two rules correct?
>
>
>
> Times when you need manually edit CRUSH map is gone. Manual editing even in 
> your case has already lead to errors.
>
>
>
> # create new datacenter and move it to default root
> ceph osd crush add-bucket new_datacenter datacenter
> ceph osd crush move new_datacenter root=default
> # create our racks
> ceph osd crush add-bucket rack_a1 rack
> ceph osd crush add-bucket rack_a2 rack
> ceph osd crush add-bucket rack_a3 rack
> # move our racks to our datacenter
> ceph osd crush move rack_a1 datacenter=new_datacenter
> ceph osd crush move rack_a2 datacenter=new_datacenter
> ceph osd crush move rack_a3 datacenter=new_datacenter
> # create our hosts
> ceph osd crush add-bucket host_a1-1 host
> ceph osd crush add-bucket host_a1-2 host
> ceph osd crush add-bucket host_a2-1 host
> ceph osd crush add-bucket host_a2-2 host
> ceph osd crush add-bucket host_a3-1 host
> ceph osd crush add-bucket host_a3-2 host
> # and move it to racks
> ceph osd crush move host_a1-1 rack=rack_a1
> ceph osd crush move host_a1-2 rack=rack_a1
> ceph osd crush move host_a2-1 rack=rack_a2
> ceph osd crush move host_a2-2 rack=rack_a2
> ceph osd crush move host_a3-1 rack=rack_a3
> ceph osd crush move host_a3-2 rack=rack_a3
> # now it's time to deploy osds. when osds is 'up' and 'in' and properly class
> # is assigned we can move it hosts. In case when class is wrong, i.e.
> # 'nvme' is detected as 'ssd' we can rewrite device class like this:
> ceph osd crush rm-device-class osd.5
> ceph osd crush set-device-class nvme osd.5
> # okay, `ceph osd tree` show our osds with device classes, move it to hosts:
> ceph osd crush move osd.0 host=host_a1-1
> ceph osd crush move osd.1 host=host_a1-1
> ceph osd crush move osd.2 host=host_a1-1
> ceph osd crush move osd.3 host=host_a1-2
> ceph osd crush move osd.4 host=host_a1-2
> ceph osd crush move osd.5 host=host_a1-2
> <etc>...</etc>
> # when this done we should reweight osds on crush map
> # ssd drives is 960Gb
> ceph osd crush reweight osd.0 0.960
> ceph osd crush reweight osd.1 0.960
> ceph osd crush reweight osd.2 0.960
> # hdd drives is 6Tb
> ceph osd crush reweight osd.3 5.5
> ceph osd crush reweight osd.4 5.5
> ceph osd crush reweight osd.5 5.5
> <etc>...</etc>
> # crush map is ready, now is time to crush rules
> ## New replication rules with device classes
> ceph osd crush rule create-replicated replicated_racks_hdd default rack hdd
> ceph osd crush rule create-replicated replicated_racks_ssd default rack ssd
> # create new pool with predefined crush rule
> ceph osd pool create replicated_rbd_hdd 128 128 replicated 
> replicated_racks_hdd
> # our failure domain is rack
> ceph osd pool set replicated_rbd_hdd min_size 2
> ceph osd pool set replicated_rbd_hdd size 3
> ceph osd pool application enable replicated_rbd_hdd rbd
> # or assign crush rule to existing pool
> ceph osd pool set replicated_rbd_ssd crush_rule replicated_racks_ssd
>
>
>
> As you can see, when you use cli your values will be validated before 
> implemented, this help in avoid human mistakes. All changes you can detect 
> online, if something wrong you have easy access to 'ceph osd tree', 'ceph osd 
> pool ls detail' and 'ceph osd crush rule dump'. I hope this will help to 
> novice in crush understanding.
>
>
>
>
>
> k
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to