[ceph-users] Re: crush rule: is it valid to use a non root element for the root parameter?

Anthony D'Atri Wed, 16 Apr 2025 07:49:41 -0700


> 
> No we use the same mons (that are also backed up by UPS/Diesel). The idea 
> doing this was to allow sharing the OSD between the pool using this "critical 
> area" (thus OSD located only in this row) and the other normal pools, to 
> avoid dedicated potentially a large storage volume to this critical area that 
> doesn't require much.


Groovy.  Since you were clearly targeting higher availability for this subset 
of data, I wanted to be sure that your efforts weren’t confounded by the 
potential for the mons to not reach quorum, which would make the CRUSH hoops 
moot.

> Thus also the choice to reweight the OSDs in this row so that they are less 
> used than other OSDs by normal pools to avoid exploding the number of PGs on 
> these OSDs.
> 
> I am not sure that we can use a custom device class to achieve what we had in 
> mind as this will not allow to share an OSD between critical and non critical 
> pools.

The above two statements seem a bit at odds with each other.  In the first 
you’re discouraging sharing and may fill up as your critical dataset grows; in 
the second you want to share.


> But it may be a better way in fact, dedicated only a  fraction of the OSDs on 
> each server in the "critical row" to these pools and using other OSDs on 
> these servers for normal pools without any reweighting. Thanks for the idea.

You bet.  It seems like a cleaner approach.  You might consider a reclassify 
operation

https://docs.ceph.com/en/latest/rados/operations/crush-map-edits/#migrating-from-a-legacy-ssd-rule-to-device-classes
to update the CRUSHmap and rules at the same time.

> 
> We may also try to bump the number of retries to see if it has an effect.
> 
> Best regards,
> 
> Michel
> 
> Le 16/04/2025 à 13:16, Anthony D'Atri a écrit :
>> First time I recall anyone trying this.  Thoughts:
>> 
>> * Manually edit the crush map and bump retries from 50 to 100
>> * Better yet, give those OSDs a custom device class and change the CRUSH 
>> rule to use that and the default root.
>> 
>> Do you also constrain mons to those systems ?
>> 
>>> On Apr 16, 2025, at 6:41 AM, Michel Jouvin <michel.jou...@ijclab.in2p3.fr> 
>>> wrote:
>>> 
>>> Hi,
>>> 
>>> We have use case where we had like to restrict some pools to a subset of 
>>> the OSDs located in a particular section of the crush map hierarchy (OSDs 
>>> backed up by UPS/Diesel). We tried to define for these (replica 3) pools a 
>>> specific crush rule with the root paramater defined to a specific row 
>>> (which contains 3 OSD servers with #10 OSD each). At the beginning it 
>>> worked but after some time (probably after doing a reweight on the OSDs in 
>>> this row to reduce the number of PGs from other pools), a few PGs are 
>>> active+clean+remapped and 1 is undersized.
>>> 
>>> 'ceph osd pg dump|grep remapped' gives an output similar to the following 
>>> one for each remapped PG:
>>> 
>>> 20.1ae       648                   0         0        648 0   374416846     
>>>        0           0    438      1404 438       active+clean+remapped  
>>> 2025-04-16T07:19:40.507778+0000 43117'1018433    48443:1131738       
>>> [70,58]          70 [70,58,45]              70   43117'1018433 
>>> 2025-04-16T07:19:40.507443+0000    43117'1018433 
>>> 2025-04-16T07:19:40.507443+0000              0 15  periodic scrub scheduled 
>>> @ 2025-04-17T19:18:23.470846+0000 648                0
>>> 
>>> We can see that we currently have 3 replica but that Ceph would like to 
>>> move to 2... (the undersized PG has currently only 2 replica for an unknown 
>>> reason, probably the same).
>>> 
>>> Is it wrong trying to do what we did, i.e. using a row for the crush rule 
>>> root parameter? If not, where could we find more information about the 
>>> cause?
>>> 
>>> Thanks in advance for any help. Best regards,
>>> 
>>> Michel
>>> 
>>> --------------------- Crush rule used -----------------
>>> 
>>> {
>>>     "rule_id": 2,
>>>     "rule_name": "ha-replicated_ruleset",
>>>     "type": 1,
>>>     "steps": [
>>>         {
>>>             "op": "take",
>>>             "item": -22,
>>>             "item_name": "row-01~hdd"
>>>         },
>>>         {
>>>             "op": "chooseleaf_firstn",
>>>             "num": 0,
>>>             "type": "host"
>>>         },
>>>         {
>>>             "op": "emit"
>>>         }
>>>     ]
>>> }
>>> 
>>> 
>>> ------------------- Beginning of the CRUSH tree -------------------
>>> 
>>> ID   CLASS  WEIGHT     TYPE NAME                         STATUS REWEIGHT  
>>> PRI-AFF
>>>  -1         843.57141  root default
>>> -19         843.57141      datacenter bat.206
>>> -21         283.81818          row row-01
>>> -15          87.32867              host cephdevel-76079
>>>   1    hdd    7.27739                  osd.1                 up 0.50000  
>>> 1.00000
>>>   2    hdd    7.27739                  osd.2                 up 0.50000  
>>> 1.00000
>>>  14    hdd    7.27739                  osd.14                up 0.50000  
>>> 1.00000
>>>  39    hdd    7.27739                  osd.39                up 0.50000  
>>> 1.00000
>>>  40    hdd    7.27739                  osd.40                up 0.50000  
>>> 1.00000
>>>  41    hdd    7.27739                  osd.41                up 0.50000  
>>> 1.00000
>>>  42    hdd    7.27739                  osd.42                up 0.50000  
>>> 1.00000
>>>  43    hdd    7.27739                  osd.43                up 0.50000  
>>> 1.00000
>>>  44    hdd    7.27739                  osd.44                up 0.50000  
>>> 1.00000
>>>  45    hdd    7.27739                  osd.45                up 0.50000  
>>> 1.00000
>>>  46    hdd    7.27739                  osd.46                up 0.50000  
>>> 1.00000
>>>  47    hdd    7.27739                  osd.47                up 0.50000  
>>> 1.00000
>>>  -3          94.60606              host cephdevel-76154
>>>  49    hdd    7.27739                  osd.49                up 0.50000  
>>> 1.00000
>>>  50    hdd    7.27739                  osd.50                up 0.50000  
>>> 1.00000
>>>  51    hdd    7.27739                  osd.51                up 0.50000  
>>> 1.00000
>>>  66    hdd    7.27739                  osd.66                up 0.50000  
>>> 1.00000
>>>  67    hdd    7.27739                  osd.67                up 0.50000  
>>> 1.00000
>>>  68    hdd    7.27739                  osd.68                up 0.50000  
>>> 1.00000
>>>  69    hdd    7.27739                  osd.69                up 0.50000  
>>> 1.00000
>>>  70    hdd    7.27739                  osd.70                up 0.50000  
>>> 1.00000
>>>  71    hdd    7.27739                  osd.71                up 0.50000  
>>> 1.00000
>>>  72    hdd    7.27739                  osd.72                up 0.50000  
>>> 1.00000
>>>  73    hdd    7.27739                  osd.73                up 0.50000  
>>> 1.00000
>>>  74    hdd    7.27739                  osd.74                up 0.50000  
>>> 1.00000
>>>  75    hdd    7.27739                  osd.75                up 0.50000  
>>> 1.00000
>>>  -4         101.88345              host cephdevel-76204
>>>  48    hdd    7.27739                  osd.48                up 0.50000  
>>> 1.00000
>>>  52    hdd    7.27739                  osd.52                up 0.50000  
>>> 1.00000
>>>  53    hdd    7.27739                  osd.53                up 0.50000  
>>> 1.00000
>>>  54    hdd    7.27739                  osd.54                up 0.50000  
>>> 1.00000
>>>  56    hdd    7.27739                  osd.56                up 0.50000  
>>> 1.00000
>>>  57    hdd    7.27739                  osd.57                up 0.50000  
>>> 1.00000
>>>  58    hdd    7.27739                  osd.58                up 0.50000  
>>> 1.00000
>>>  59    hdd    7.27739                  osd.59                up 0.50000  
>>> 1.00000
>>>  60    hdd    7.27739                  osd.60                up 0.50000  
>>> 1.00000
>>>  61    hdd    7.27739                  osd.61                up 0.50000  
>>> 1.00000
>>>  62    hdd    7.27739                  osd.62                up 0.50000  
>>> 1.00000
>>>  63    hdd    7.27739                  osd.63                up 0.50000  
>>> 1.00000
>>>  64    hdd    7.27739                  osd.64                up 0.50000  
>>> 1.00000
>>>  65    hdd    7.27739                  osd.65                up 0.50000  
>>> 1.00000
>>> -23         203.16110          row row-02
>>> -13          87.32867              host cephdevel-76213
>>>  27    hdd    7.27739                  osd.27                up 1.00000  
>>> 1.00000
>>>  28    hdd    7.27739                  osd.28                up 1.00000  
>>> 1.00000
>>>  29    hdd    7.27739                  osd.29                up 1.00000  
>>> 1.00000
>>>  30    hdd    7.27739                  osd.30                up 1.00000  
>>> 1.00000
>>>  31    hdd    7.27739                  osd.31                up 1.00000  
>>> 1.00000
>>>  32    hdd    7.27739                  osd.32                up 1.00000  
>>> 1.00000
>>>  33    hdd    7.27739                  osd.33                up 1.00000  
>>> 1.00000
>>>  34    hdd    7.27739                  osd.34                up 1.00000  
>>> 1.00000
>>>  35    hdd    7.27739                  osd.35                up 1.00000  
>>> 1.00000
>>>  36    hdd    7.27739                  osd.36                up 1.00000  
>>> 1.00000
>>>  37    hdd    7.27739                  osd.37                up 1.00000  
>>> 1.00000
>>>  38    hdd    7.27739                  osd.38                up 1.00000  
>>> 1.00000
>>> ......
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users@ceph.io
>>> To unsubscribe send an email to ceph-users-le...@ceph.io

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: crush rule: is it valid to use a non root element for the root parameter?

Reply via email to