[ceph-users] Re: crush rule: is it valid to use a non root element for the root parameter?

Enrico Bocchi Tue, 29 Apr 2025 04:26:54 -0700

Hi Michel,

From your description, I understand you have a critical-data pool thatshould be hosted on the critical-power row, but you may want to usethese same OSDs to store PGs from other pools as well.As you have noticed this will lead to having more PGs on thecritical-power OSDs. However, the weight is an OSD parameter: I am notsure lowering the weight will lead to evacuating PGs of non-criticalpools...

We use device classes and entry points (e.g., room, row, ...) in thecrush tree to define crush rules and allocate pools to specific OSDs. Ifind device classes more practical (personal opinion) to reserve someOSDs for specific pools/applications if you have OSDs all over the crushtree.


Is the undersized PG from the critical or non-critical pool?

Cheers,
Enrico


On 4/16/25 14:01, Michel Jouvin wrote:

Hi Anthony,

No we use the same mons (that are also backed up by UPS/Diesel). Theidea doing this was to allow sharing the OSD between the pool usingthis "critical area" (thus OSD located only in this row) and the othernormal pools, to avoid dedicated potentially a large storage volume tothis critical area that doesn't require much. Thus also the choice toreweight the OSDs in this row so that they are less used than otherOSDs by normal pools to avoid exploding the number of PGs on these OSDs.

I am not sure that we can use a custom device class to achieve what wehad in mind as this will not allow to share an OSD between criticaland non critical pools. But it may be a better way in fact, dedicatedonly a fraction of the OSDs on each server in the "critical row" tothese pools and using other OSDs on these servers for normal poolswithout any reweighting. Thanks for the idea.


We may also try to bump the number of retries to see if it has an effect.

Best regards,

Michel

Le 16/04/2025 à 13:16, Anthony D'Atri a écrit :

First time I recall anyone trying this. Thoughts:

* Manually edit the crush map and bump retries from 50 to 100

* Better yet, give those OSDs a custom device class and change theCRUSH rule to use that and the default root.


Do you also constrain mons to those systems ?

On Apr 16, 2025, at 6:41 AM, Michel Jouvin<michel.jou...@ijclab.in2p3.fr> wrote:

Hi,

We have use case where we had like to restrict some pools to asubset of the OSDs located in a particular section of the crush maphierarchy (OSDs backed up by UPS/Diesel). We tried to define forthese (replica 3) pools a specific crush rule with the rootparamater defined to a specific row (which contains 3 OSD serverswith #10 OSD each). At the beginning it worked but after some time(probably after doing a reweight on the OSDs in this row to reducethe number of PGs from other pools), a few PGs areactive+clean+remapped and 1 is undersized.

'ceph osd pg dump|grep remapped' gives an output similar to thefollowing one for each remapped PG:

20.1ae 648 0 0 648 0374416846 0 0 438 1404 438active+clean+remapped 2025-04-16T07:19:40.507778+000043117'1018433 48443:1131738 [70,58] 70[70,58,45] 70 43117'10184332025-04-16T07:19:40.507443+0000 43117'10184332025-04-16T07:19:40.507443+0000 0 15 periodic scrubscheduled @ 2025-04-17T19:18:23.470846+0000 648 0

We can see that we currently have 3 replica but that Ceph would liketo move to 2... (the undersized PG has currently only 2 replica foran unknown reason, probably the same).

Is it wrong trying to do what we did, i.e. using a row for the crushrule root parameter? If not, where could we find more informationabout the cause?


Thanks in advance for any help. Best regards,

Michel

--------------------- Crush rule used -----------------

{
     "rule_id": 2,
     "rule_name": "ha-replicated_ruleset",
     "type": 1,
     "steps": [
         {
             "op": "take",
             "item": -22,
             "item_name": "row-01~hdd"
         },
         {
             "op": "chooseleaf_firstn",
             "num": 0,
             "type": "host"
         },
         {
             "op": "emit"
         }
     ]
}


------------------- Beginning of the CRUSH tree -------------------

ID   CLASS  WEIGHT     TYPE NAME STATUS REWEIGHT  PRI-AFF
  -1         843.57141  root default
-19         843.57141      datacenter bat.206
-21         283.81818          row row-01
-15          87.32867              host cephdevel-76079
   1    hdd    7.27739                  osd.1 up 0.50000  1.00000
   2    hdd    7.27739                  osd.2 up 0.50000  1.00000
  14    hdd    7.27739                  osd.14 up 0.50000  1.00000
  39    hdd    7.27739                  osd.39 up 0.50000  1.00000
  40    hdd    7.27739                  osd.40 up 0.50000  1.00000
  41    hdd    7.27739                  osd.41 up 0.50000  1.00000
  42    hdd    7.27739                  osd.42 up 0.50000  1.00000
  43    hdd    7.27739                  osd.43 up 0.50000  1.00000
  44    hdd    7.27739                  osd.44 up 0.50000  1.00000
  45    hdd    7.27739                  osd.45 up 0.50000  1.00000
  46    hdd    7.27739                  osd.46 up 0.50000  1.00000
  47    hdd    7.27739                  osd.47 up 0.50000  1.00000
  -3          94.60606              host cephdevel-76154
  49    hdd    7.27739                  osd.49 up 0.50000  1.00000
  50    hdd    7.27739                  osd.50 up 0.50000  1.00000
  51    hdd    7.27739                  osd.51 up 0.50000  1.00000
  66    hdd    7.27739                  osd.66 up 0.50000  1.00000
  67    hdd    7.27739                  osd.67 up 0.50000  1.00000
  68    hdd    7.27739                  osd.68 up 0.50000  1.00000
  69    hdd    7.27739                  osd.69 up 0.50000  1.00000
  70    hdd    7.27739                  osd.70 up 0.50000  1.00000
  71    hdd    7.27739                  osd.71 up 0.50000  1.00000
  72    hdd    7.27739                  osd.72 up 0.50000  1.00000
  73    hdd    7.27739                  osd.73 up 0.50000  1.00000
  74    hdd    7.27739                  osd.74 up 0.50000  1.00000
  75    hdd    7.27739                  osd.75 up 0.50000  1.00000
  -4         101.88345              host cephdevel-76204
  48    hdd    7.27739                  osd.48 up 0.50000  1.00000
  52    hdd    7.27739                  osd.52 up 0.50000  1.00000
  53    hdd    7.27739                  osd.53 up 0.50000  1.00000
  54    hdd    7.27739                  osd.54 up 0.50000  1.00000
  56    hdd    7.27739                  osd.56 up 0.50000  1.00000
  57    hdd    7.27739                  osd.57 up 0.50000  1.00000
  58    hdd    7.27739                  osd.58 up 0.50000  1.00000
  59    hdd    7.27739                  osd.59 up 0.50000  1.00000
  60    hdd    7.27739                  osd.60 up 0.50000  1.00000
  61    hdd    7.27739                  osd.61 up 0.50000  1.00000
  62    hdd    7.27739                  osd.62 up 0.50000  1.00000
  63    hdd    7.27739                  osd.63 up 0.50000  1.00000
  64    hdd    7.27739                  osd.64 up 0.50000  1.00000
  65    hdd    7.27739                  osd.65 up 0.50000  1.00000
-23         203.16110          row row-02
-13          87.32867              host cephdevel-76213
  27    hdd    7.27739                  osd.27 up 1.00000  1.00000
  28    hdd    7.27739                  osd.28 up 1.00000  1.00000
  29    hdd    7.27739                  osd.29 up 1.00000  1.00000
  30    hdd    7.27739                  osd.30 up 1.00000  1.00000
  31    hdd    7.27739                  osd.31 up 1.00000  1.00000
  32    hdd    7.27739                  osd.32 up 1.00000  1.00000
  33    hdd    7.27739                  osd.33 up 1.00000  1.00000
  34    hdd    7.27739                  osd.34 up 1.00000  1.00000
  35    hdd    7.27739                  osd.35 up 1.00000  1.00000
  36    hdd    7.27739                  osd.36 up 1.00000  1.00000
  37    hdd    7.27739                  osd.37 up 1.00000  1.00000
  38    hdd    7.27739                  osd.38 up 1.00000  1.00000
......
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


--
Enrico Bocchi
CERN European Laboratory for Particle Physics
IT - Storage & Data Management  - General Storage Services
Mailbox: G20500 - Office: 31-2-010
1211 Genève 23
Switzerland
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: crush rule: is it valid to use a non root element for the root parameter?

Reply via email to