[ceph-users] Re: crush rule: is it valid to use a non root element for the root parameter?

Michel Jouvin Tue, 29 Apr 2025 04:58:28 -0700

Hi Enrico,

Thanks for confirming Anthony's suggestion. It is what we adopted and it ismuch easier and more efficient than our original idea, I agree.


Best regards,

Michel
Sent from my mobile
Le 29 avril 2025 13:25:22 Enrico Bocchi <enrico.boc...@cern.ch> a écrit :

Hi Michel,

From your description, I understand you have a critical-data pool that
should be hosted on the critical-power row, but you may want to use
these same OSDs to store PGs from other pools as well.
As you have noticed this will lead to having more PGs on the
critical-power OSDs. However, the weight is an OSD parameter: I am not
sure lowering the weight will lead to evacuating PGs of non-critical
pools...

We use device classes and entry points (e.g., room, row, ...) in the
crush tree to define crush rules and allocate pools to specific OSDs. I
find device classes more practical (personal opinion) to reserve some
OSDs for specific pools/applications if you have OSDs all over the crush
tree.

Is the undersized PG from the critical or non-critical pool?

Cheers,
Enrico


On 4/16/25 14:01, Michel Jouvin wrote:

Hi Anthony,

No we use the same mons (that are also backed up by UPS/Diesel). The
idea doing this was to allow sharing the OSD between the pool using
this "critical area" (thus OSD located only in this row) and the other
normal pools, to avoid dedicated potentially a large storage volume to
this critical area that doesn't require much. Thus also the choice to
reweight the OSDs in this row so that they are less used than other
OSDs by normal pools to avoid exploding the number of PGs on these OSDs.

I am not sure that we can use a custom device class to achieve what we
had in mind as this will not allow to share an OSD between critical
and non critical pools. But it may be a better way in fact, dedicated
only a  fraction of the OSDs on each server in the "critical row" to
these pools and using other OSDs on these servers for normal pools
without any reweighting. Thanks for the idea.

We may also try to bump the number of retries to see if it has an effect.

Best regards,

Michel

Le 16/04/2025 à 13:16, Anthony D'Atri a écrit :

First time I recall anyone trying this. Thoughts:

* Manually edit the crush map and bump retries from 50 to 100
* Better yet, give those OSDs a custom device class and change the
CRUSH rule to use that and the default root.

Do you also constrain mons to those systems ?

On Apr 16, 2025, at 6:41 AM, Michel Jouvin
<michel.jou...@ijclab.in2p3.fr> wrote:

Hi,

We have use case where we had like to restrict some pools to a
subset of the OSDs located in a particular section of the crush map
hierarchy (OSDs backed up by UPS/Diesel). We tried to define for
these (replica 3) pools a specific crush rule with the root
paramater defined to a specific row (which contains 3 OSD servers
with #10 OSD each). At the beginning it worked but after some time
(probably after doing a reweight on the OSDs in this row to reduce
the number of PGs from other pools), a few PGs are
active+clean+remapped and 1 is undersized.

'ceph osd pg dump|grep remapped' gives an output similar to the
following one for each remapped PG:

20.1ae       648                   0         0        648 0
374416846            0           0    438      1404 438
active+clean+remapped  2025-04-16T07:19:40.507778+0000
43117'1018433    48443:1131738       [70,58]          70
[70,58,45]              70   43117'1018433
2025-04-16T07:19:40.507443+0000    43117'1018433
2025-04-16T07:19:40.507443+0000              0 15  periodic scrub
scheduled @ 2025-04-17T19:18:23.470846+0000 648                0

We can see that we currently have 3 replica but that Ceph would like
to move to 2... (the undersized PG has currently only 2 replica for
an unknown reason, probably the same).

Is it wrong trying to do what we did, i.e. using a row for the crush
rule root parameter? If not, where could we find more information
about the cause?

Thanks in advance for any help. Best regards,

Michel

--------------------- Crush rule used -----------------

{
    "rule_id": 2,
    "rule_name": "ha-replicated_ruleset",
    "type": 1,
    "steps": [
        {
            "op": "take",
            "item": -22,
            "item_name": "row-01~hdd"
        },
        {
            "op": "chooseleaf_firstn",
            "num": 0,
            "type": "host"
        },
        {
            "op": "emit"
        }
    ]
}


------------------- Beginning of the CRUSH tree -------------------

ID   CLASS  WEIGHT     TYPE NAME STATUS REWEIGHT  PRI-AFF
 -1         843.57141  root default
-19         843.57141      datacenter bat.206
-21         283.81818          row row-01
-15          87.32867              host cephdevel-76079
  1    hdd    7.27739                  osd.1 up 0.50000  1.00000
  2    hdd    7.27739                  osd.2 up 0.50000  1.00000
 14    hdd    7.27739                  osd.14 up 0.50000  1.00000
 39    hdd    7.27739                  osd.39 up 0.50000  1.00000
 40    hdd    7.27739                  osd.40 up 0.50000  1.00000
 41    hdd    7.27739                  osd.41 up 0.50000  1.00000
 42    hdd    7.27739                  osd.42 up 0.50000  1.00000
 43    hdd    7.27739                  osd.43 up 0.50000  1.00000
 44    hdd    7.27739                  osd.44 up 0.50000  1.00000
 45    hdd    7.27739                  osd.45 up 0.50000  1.00000
 46    hdd    7.27739                  osd.46 up 0.50000  1.00000
 47    hdd    7.27739                  osd.47 up 0.50000  1.00000
 -3          94.60606              host cephdevel-76154
 49    hdd    7.27739                  osd.49 up 0.50000  1.00000
 50    hdd    7.27739                  osd.50 up 0.50000  1.00000
 51    hdd    7.27739                  osd.51 up 0.50000  1.00000
 66    hdd    7.27739                  osd.66 up 0.50000  1.00000
 67    hdd    7.27739                  osd.67 up 0.50000  1.00000
 68    hdd    7.27739                  osd.68 up 0.50000  1.00000
 69    hdd    7.27739                  osd.69 up 0.50000  1.00000
 70    hdd    7.27739                  osd.70 up 0.50000  1.00000
 71    hdd    7.27739                  osd.71 up 0.50000  1.00000
 72    hdd    7.27739                  osd.72 up 0.50000  1.00000
 73    hdd    7.27739                  osd.73 up 0.50000  1.00000
 74    hdd    7.27739                  osd.74 up 0.50000  1.00000
 75    hdd    7.27739                  osd.75 up 0.50000  1.00000
 -4         101.88345              host cephdevel-76204
 48    hdd    7.27739                  osd.48 up 0.50000  1.00000
 52    hdd    7.27739                  osd.52 up 0.50000  1.00000
 53    hdd    7.27739                  osd.53 up 0.50000  1.00000
 54    hdd    7.27739                  osd.54 up 0.50000  1.00000
 56    hdd    7.27739                  osd.56 up 0.50000  1.00000
 57    hdd    7.27739                  osd.57 up 0.50000  1.00000
 58    hdd    7.27739                  osd.58 up 0.50000  1.00000
 59    hdd    7.27739                  osd.59 up 0.50000  1.00000
 60    hdd    7.27739                  osd.60 up 0.50000  1.00000
 61    hdd    7.27739                  osd.61 up 0.50000  1.00000
 62    hdd    7.27739                  osd.62 up 0.50000  1.00000
 63    hdd    7.27739                  osd.63 up 0.50000  1.00000
 64    hdd    7.27739                  osd.64 up 0.50000  1.00000
 65    hdd    7.27739                  osd.65 up 0.50000  1.00000
-23         203.16110          row row-02
-13          87.32867              host cephdevel-76213
 27    hdd    7.27739                  osd.27 up 1.00000  1.00000
 28    hdd    7.27739                  osd.28 up 1.00000  1.00000
 29    hdd    7.27739                  osd.29 up 1.00000  1.00000
 30    hdd    7.27739                  osd.30 up 1.00000  1.00000
 31    hdd    7.27739                  osd.31 up 1.00000  1.00000
 32    hdd    7.27739                  osd.32 up 1.00000  1.00000
 33    hdd    7.27739                  osd.33 up 1.00000  1.00000
 34    hdd    7.27739                  osd.34 up 1.00000  1.00000
 35    hdd    7.27739                  osd.35 up 1.00000  1.00000
 36    hdd    7.27739                  osd.36 up 1.00000  1.00000
 37    hdd    7.27739                  osd.37 up 1.00000  1.00000
 38    hdd    7.27739                  osd.38 up 1.00000  1.00000
......
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


--
Enrico Bocchi
CERN European Laboratory for Particle Physics
IT - Storage & Data Management  - General Storage Services
Mailbox: G20500 - Office: 31-2-010
1211 Genève 23
Switzerland


_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: crush rule: is it valid to use a non root element for the root parameter?

Reply via email to