Hi Konstantin, I could only dream of reading this answer! Thank you so much!!!
Regards, Cody On Tue, Aug 21, 2018 at 8:50 AM Konstantin Shalygin <k0...@k0ste.ru> wrote: > > On 08/20/2018 08:15 PM, Cody wrote: > > Hi Konstantin, > > Thank you for looking into my question. > > I was trying to understand how to set up CRUSH hierarchies and set > rules for different failure domains. I am particularly confused by the > 'step take' and 'step choose|chooseleaf' settings for which I think > are the keys for defining a failure domain in a CRUSH rule. > > As for my hypothetical cluster, it is made of 3 racks with 2 hosts on > each. One host has 3 SSD-based OSDs and the other has 3 HDD-based > OSDs. I wished to create two rules: one uses SSD-only and another > HDD-only. Both rules should have a rack level failure domain. > > I have attached a diagram that may help to explain my setup. The > following is my CRUSH map configuration (with all typos fixed) for > review: > > device 0 osd.0 class ssd > device 1 osd.1 class ssd > device 2 osd.2 class ssd > device 3 osd.3 class hdd > device 4 osd.4 class hdd > device 5 osd.5 class hdd > device 6 osd.6 class ssd > device 7 osd.7 class ssd > device 8 osd.8 class ssd > device 9 osd.9 class hdd > device 10 osd.10 class hdd > device 11 osd.11 class hdd > device 12 osd.12 class ssd > device 13 osd.13 class ssd > device 14 osd.14 class ssd > device 15 osd.15 class hdd > device 16 osd.17 class hdd > device 17 osd.17 class hdd > > host a1-1 { > id -1 > alg straw > hash 0 > item osd.0 weight 1.00 > item osd.1 weight 1.00 > item osd.2 weight 1.00 > } > > host a1-2 { > id -2 > alg straw > hash 0 > item osd.3 weight 1.00 > item osd.4 weight 1.00 > item osd.5 weight 1.00 > } > > host a2-1 { > id -3 > alg straw > hash 0 > item osd.6 weight 1.00 > item osd.7 weight 1.00 > item osd.8 weight 1.00 > } > > host a2-2 { > id -4 > alg straw > hash 0 > item osd.9 weight 1.00 > item osd.10 weight 1.00 > item osd.11 weight 1.00 > } > > host a3-1 { > id -5 > alg straw > hash 0 > item osd.12 weight 1.00 > item osd.13 weight 1.00 > item osd.14 weight 1.00 > } > > host a3-2 { > id -6 > alg straw > hash 0 > item osd.15 weight 1.00 > item osd.16 weight 1.00 > item osd.17 weight 1.00 > } > > rack a1 { > id -7 > alg straw > hash 0 > item a1-1 weight 3.0 > item a1-2 weight 3.0 > } > > rack a2 { > id -5 > alg straw > hash 0 > item a2-1 weight 3.0 > item a2-2 weight 3.0 > } > > rack a3 { > id -6 > alg straw > hash 0 > item a3-1 weight 3.0 > item a3-2 weight 3.0 > } > > row a { > id -7 > alg straw > hash 0 > item a1 6.0 > item a2 6.0 > item a3 6.0 > } > > rule ssd { > id 1 > type replicated > min_size 2 > max_size 11 > step take a class ssd > step chooseleaf firstn 0 type rack > step emit > } > > rule hdd { > id 2 > type replicated > min_size 2 > max_size 11 > step take a class hdd > step chooseleaf firstn 0 type rack > step emit > } > > > Are the two rules correct? > > > > Times when you need manually edit CRUSH map is gone. Manual editing even in > your case has already lead to errors. > > > > # create new datacenter and move it to default root > ceph osd crush add-bucket new_datacenter datacenter > ceph osd crush move new_datacenter root=default > # create our racks > ceph osd crush add-bucket rack_a1 rack > ceph osd crush add-bucket rack_a2 rack > ceph osd crush add-bucket rack_a3 rack > # move our racks to our datacenter > ceph osd crush move rack_a1 datacenter=new_datacenter > ceph osd crush move rack_a2 datacenter=new_datacenter > ceph osd crush move rack_a3 datacenter=new_datacenter > # create our hosts > ceph osd crush add-bucket host_a1-1 host > ceph osd crush add-bucket host_a1-2 host > ceph osd crush add-bucket host_a2-1 host > ceph osd crush add-bucket host_a2-2 host > ceph osd crush add-bucket host_a3-1 host > ceph osd crush add-bucket host_a3-2 host > # and move it to racks > ceph osd crush move host_a1-1 rack=rack_a1 > ceph osd crush move host_a1-2 rack=rack_a1 > ceph osd crush move host_a2-1 rack=rack_a2 > ceph osd crush move host_a2-2 rack=rack_a2 > ceph osd crush move host_a3-1 rack=rack_a3 > ceph osd crush move host_a3-2 rack=rack_a3 > # now it's time to deploy osds. when osds is 'up' and 'in' and properly class > # is assigned we can move it hosts. In case when class is wrong, i.e. > # 'nvme' is detected as 'ssd' we can rewrite device class like this: > ceph osd crush rm-device-class osd.5 > ceph osd crush set-device-class nvme osd.5 > # okay, `ceph osd tree` show our osds with device classes, move it to hosts: > ceph osd crush move osd.0 host=host_a1-1 > ceph osd crush move osd.1 host=host_a1-1 > ceph osd crush move osd.2 host=host_a1-1 > ceph osd crush move osd.3 host=host_a1-2 > ceph osd crush move osd.4 host=host_a1-2 > ceph osd crush move osd.5 host=host_a1-2 > <etc>...</etc> > # when this done we should reweight osds on crush map > # ssd drives is 960Gb > ceph osd crush reweight osd.0 0.960 > ceph osd crush reweight osd.1 0.960 > ceph osd crush reweight osd.2 0.960 > # hdd drives is 6Tb > ceph osd crush reweight osd.3 5.5 > ceph osd crush reweight osd.4 5.5 > ceph osd crush reweight osd.5 5.5 > <etc>...</etc> > # crush map is ready, now is time to crush rules > ## New replication rules with device classes > ceph osd crush rule create-replicated replicated_racks_hdd default rack hdd > ceph osd crush rule create-replicated replicated_racks_ssd default rack ssd > # create new pool with predefined crush rule > ceph osd pool create replicated_rbd_hdd 128 128 replicated > replicated_racks_hdd > # our failure domain is rack > ceph osd pool set replicated_rbd_hdd min_size 2 > ceph osd pool set replicated_rbd_hdd size 3 > ceph osd pool application enable replicated_rbd_hdd rbd > # or assign crush rule to existing pool > ceph osd pool set replicated_rbd_ssd crush_rule replicated_racks_ssd > > > > As you can see, when you use cli your values will be validated before > implemented, this help in avoid human mistakes. All changes you can detect > online, if something wrong you have easy access to 'ceph osd tree', 'ceph osd > pool ls detail' and 'ceph osd crush rule dump'. I hope this will help to > novice in crush understanding. > > > > > > k _______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com