Hi, Still trying to understand what is really happening under the hood, Did
more test and collected the data
I Changed the `osd max pg per osd hard ratio` to 16384 but this didn't
change anything

Scenario : 4 nodes , 4 disk per node , ceph 12.2.7
1. create 4 OSD with device class <pool1>
2. create pool with crush rule mapping to device class <pool1>
3. Everything OK
4. remove the 4 OSD ( delete them ) -> 0 OSD present in the system , all PG
are "stalled"
5. recreate 4 new OSD ( using same disk or other disks , this doesn't
matter) with ceph device class <pool2>
6. try to create pool : pool2, mapping to the device class <pool2>

Pool2 creation failure : ERANGE:  pg_num 256 size 2 would mean 1024 total
pgs, which exceeds  max 800 (mon_max_pg_per_osd 200 * num_in_osds 4)

How did the pool check come up with 1024 total PGs when we only allocate
for the pool 256 pg with size 2  ( 256*2 = 512 )
It looks like the pool check algo is calculating the total number of PG for
all pools i.e. pool1 (256*2) + pool2 (256*2) rather than using the number
of PG for the pool and associated OSD only


PG dump : https://paste.ee/p/sP8xZ
Ceph status :
  cluster:
    id:     ea0df043-7b25-4447-a43d-e9b2af8fe069
    health: HEALTH_WARN
            Reduced data availability: 37 pgs inactive, 37 pgs peering, 256
pgs stale

  services:
    mon: 3 daemons, quorum
stratonode0.node.strato,stratonode1.node.strato,stratonode3.node.strato
    mgr: stratonode0(active), standbys: stratonode1, stratonode3
    osd: 4 osds: 4 up, 4 in

  data:
    pools:   1 pools, 256 pgs
    objects: 0 objects, 0 bytes
    usage:   4117 MB used, 9311 GB / 9315 GB avail
    pgs:     14.453% pgs not active
             219 stale+active+clean
             37  stale+peering


Is there a simple way to go around this limits in the meantime? I could max
out the max_pg  value while retaining the 200 limit per OSD for my own PG
allocation calculation

Note that allocating 2 pools, 256 PG each and replica 2 , each with their
own dedicated device class mapping to theirs own 4 OSD works without any
issue ( basically the above scenario without the deletion step).

It seems that it is only in the case of deleting the OSD that the whole
calculation get screwed.






On Thu, 26 Jul 2018 at 20:52, Benoit Hudzia <ben...@stratoscale.com> wrote:

> Sorry missing the pg dump :
>
> 2.1           0                  0        0         0       0     0   0
>     0 stale+peering 2018-07-26 19:38:13.381673     0'0    125:9 [3]
>   3    [3]              3        0'0 2018-07-26 15:20:08.965357
>  0'0 2018-07-26 15:20:08.965357             0
> 2.0           0                  0        0         0       0     0   0
>     0 stale+peering 2018-07-26 19:38:13.345341     0'0   125:13 [3]
>   3    [3]              3        0'0 2018-07-26 15:20:08.965357
>  0'0 2018-07-26 15:20:08.965357             0
>
> 2 0 0 0 0 0 0 0 0
>
> sum 0 0 0 0 0 0 0 0
> OSD_STAT USED  AVAIL TOTAL HB_PEERS PG_SUM PRIMARY_PG_SUM
> 3        1051M 1861G 1863G  [0,1,2]    256            256
> 2        1051M 1861G 1863G  [0,1,3]      0              0
> 1        1051M 3724G 3726G  [0,2,3]      0              0
> 0        1051M 1861G 1863G  [1,2,3]      0              0
> sum      4205M 9310G 9315G
>
> For some reason it seems that some PG are allocated to osd 3 ( but stall +
> peering)
>
> This is kind of odd
>
> On Thu, 26 Jul 2018 at 20:50, Benoit Hudzia <ben...@stratoscale.com>
> wrote:
>
>> You are correct the PG are stale ( not allocated )
>>
>> [root@stratonode1 /]# ceph status
>>   cluster:
>>     id:     ea0df043-7b25-4447-a43d-e9b2af8fe069
>>     health: HEALTH_WARN
>>             Reduced data availability: 256 pgs inactive, 256 pgs peering,
>> 256 pgs stale
>>
>>   services:
>>     mon: 3 daemons, quorum
>> stratonode1.node.strato,stratonode2.node.strato,stratonode0.node.strato
>>     mgr: stratonode1(active), standbys: stratonode2, stratonode3
>>     osd: 4 osds: 4 up, 4 in
>>
>>   data:
>>     pools:   1 pools, 256 pgs
>>     objects: 0 objects, 0 bytes
>>     usage:   4192 MB used, 9310 GB / 9315 GB avail
>>     pgs:     100.000% pgs not active
>>              256 stale+peering
>>
>> PG dump : show all PG in stale + peering
>>
>> However it s kind of strange it show some PG associated with OSD 3
>>
>>
>> So it seems that PGcalc is not taking into account the ruleset .....
>>
>> Do you think that changing ""osd max pg per osd hard ratio""  to a huge
>> number (1M) would be a valid temp workaround  ?
>>
>> We always allocate pool with dedicated OSD using the device class rule
>> set , so we never have pool sharing OSD .
>>
>> I ll open a bug with ceph regarding pg creation check ignoring the crush
>> ruleset.
>>
>>
>> On Thu, 26 Jul 2018 at 17:11, John Spray <jsp...@redhat.com> wrote:
>>
>>> On Thu, Jul 26, 2018 at 4:57 PM Benoit Hudzia <ben...@stratoscale.com>
>>> wrote:
>>>
>>>> HI,
>>>>
>>>> We currently segregate ceph pool PG allocation using the crush device
>>>> class ruleset as described:
>>>> https://ceph.com/community/new-luminous-crush-device-classes/
>>>> simply using the following command to define the rule :  ceph osd crush
>>>> rule create-replicated <RULE> default host <DEVICE CLASS>
>>>>
>>>> However, we noticed that the rule is not strict in certain scenarios.
>>>> By that, I mean that if there is no OSD of the specific device class ceph
>>>> will allocate PG for this pool to any other OSD available ( creating an
>>>> issue with the PG calculation when we want to add new pool)
>>>>
>>>> Simple scenario :
>>>> 1. create 1 Pool : <pool1> , replication 2 with 4 nodes , 1 OSD each
>>>> . belonging to class <pool1>
>>>> 2. remove all OSD ( delete them )
>>>> 3. create  4 new OSD (using same disk but different ID) but this time
>>>> tag them with class <pool2>
>>>> 4. Try to create pool <pool2> -> this will fail with
>>>>
>>>> the pool creation will fail with  output : Error ERANGE:  pg_num 256
>>>> size 2 would mean 1024 total pgs, which exceeds  max 800
>>>> (mon_max_pg_per_osd 200 * num_in_osds 4)"
>>>>
>>>> Pool1 simply started allocating PG to OSD that doesn't belong to the
>>>> ruleset
>>>>
>>>
>>> Are you sure pool 1's PGs are actually being placed on the wrong OSDs?
>>> Have you looked at the output of "ceph pg dump" to check that?
>>>
>>> It sounds more like the pool creation check is simply failing to
>>> consider the crush rules and applying a cruder global check.
>>>
>>> John
>>>
>>>
>>>>
>>>> Which leads me to the following question:  is there a way to make the
>>>> crush rule a hard requirement. E.g : if we do not have any osd matching the
>>>> device class , it won't start trying to allocate pg to OSD that doesn't
>>>> match it?
>>>>
>>>> Is there any way to prevent pool 1 to use the OSD ?
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Dr. Benoit Hudzia
>>>>
>>>> Mobile (UK): +44 (0) 75 346 78673
>>>> Mobile (IE):  +353 (0) 89 219 3675
>>>> Email: ben...@stratoscale.com
>>>>
>>>>
>>>>
>>>> Web <http://www.stratoscale.com/> | Blog
>>>> <http://www.stratoscale.com/blog/> | Twitter
>>>> <https://twitter.com/Stratoscale> | Google+
>>>> <https://plus.google.com/u/1/b/108421603458396133912/108421603458396133912/posts>
>>>>  | Linkedin <https://www.linkedin.com/company/stratoscale>
>>>>
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-users@lists.ceph.com
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>
>>>
>>
>> --
>> Dr. Benoit Hudzia
>>
>> Mobile (UK): +44 (0) 75 346 78673
>> Mobile (IE):  +353 (0) 89 219 3675
>> Email: ben...@stratoscale.com
>>
>>
>>
>> Web <http://www.stratoscale.com/> | Blog
>> <http://www.stratoscale.com/blog/> | Twitter
>> <https://twitter.com/Stratoscale> | Google+
>> <https://plus.google.com/u/1/b/108421603458396133912/108421603458396133912/posts>
>>  | Linkedin <https://www.linkedin.com/company/stratoscale>
>>
>>
>
> --
> Dr. Benoit Hudzia
>
> Mobile (UK): +44 (0) 75 346 78673
> Mobile (IE):  +353 (0) 89 219 3675
> Email: ben...@stratoscale.com
>
>
>
> Web <http://www.stratoscale.com/> | Blog
> <http://www.stratoscale.com/blog/> | Twitter
> <https://twitter.com/Stratoscale> | Google+
> <https://plus.google.com/u/1/b/108421603458396133912/108421603458396133912/posts>
>  | Linkedin <https://www.linkedin.com/company/stratoscale>
>
>

-- 
Dr. Benoit Hudzia

Mobile (UK): +44 (0) 75 346 78673
Mobile (IE):  +353 (0) 89 219 3675
Email: ben...@stratoscale.com



Web <http://www.stratoscale.com/> | Blog <http://www.stratoscale.com/blog/>
 | Twitter <https://twitter.com/Stratoscale> | Google+
<https://plus.google.com/u/1/b/108421603458396133912/108421603458396133912/posts>
 | Linkedin <https://www.linkedin.com/company/stratoscale>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to