[ceph-users] Re: Production cluster in bad shape after several OSD crashes

Frédéric Nass Thu, 27 Mar 2025 03:41:21 -0700

Hi Michel,

A common reason for PGs being stuck during activation is reaching the hard 
limit of PGs per OSD. You might want to compare the number of PGs osd.17 has 
(ceph osd df tree | grep -E 'osd.17 |PGS') to the hard limit set in your 
cluster (echo "`ceph config get osd.0 mon_max_pg_per_osd`*`ceph config get 
osd.0 osd_max_pg_per_osd_hard_ratio`" | bc). If both values are close to each 
other, increasing osd_max_pg_per_osd_hard_ratio should help. Re-peering PG 
32.7ef (ceph pg repeer 32.7ef) may also help. Also, considering the situation, 
make sure to disable the PG autoscaler if in use.


Regarding the overall cluster state, it is difficult to make any assessment 
without analyzing events and logs from MONs and OSDs. The fact that the 'rados 
ls' command fails for more than half of the pools suggests that one (or more) 
OSDs is misbehaving. Have you checked all OSDs and MONs logs?

If you're hesitant about restarting the OSDs, you could try re-peering all PGs 
of a single pool and see if it helps the 'rados ls'.

Regards,
Frédéric.

----- Le 27 Mar 25, à 10:42, Michel Jouvin michel.jou...@ijclab.in2p3.fr a 
écrit :

> Hi,
> 
> I have not seen an answer yet, help would be very much appreciated as
> our production cluster seems in a worst shape that initially described...
> 
> After a deeper analysis, we found that more than half of the pools,
> despite reported as ok, are not accessible: the 'rados ls' command is
> stuck when we try to access them. It is not correlated to the EC versus
> 3 replica pool configuration (both are affected or can be ok). We don't
> have clear view whether we have a corruption problem (but it is unclear
> why it would have happened) or a communication problem between some
> cluster components that may explain that pools reported as good are not
> accessible and that one PG remains in the activating+remapped state,
> something that seems unusual (I could not find any reference to this
> with my Google searches).
> 
> Then we are hesitating between forcing a deep scrub of everything (we
> have not seen any errors reported by scrubs/deep scrubs run in the last
> days) or restarting the whole cluster in case there is a kind of
> deadlock in the communication between some mon and/or osd. In case the
> second approach is suggested (cluster restart), should we do it daemon
> by daemon or shutdown everything and do a cold restart of the cluster?
> 
> I don't want to clutter this already too long thread with too many
> details but one of my colleague gave me the 'ceph -s' and 'ceph osd
> status` output before he started to reboot servers. If it is useful, I
> can share them.
> 
> Again, thanks in advance for any help/hint.
> 
> Best regards,
> 
> Michel
> 
> Le 26/03/2025 à 21:54, Michel Jouvin a écrit :
>> And sorry for all these mails, I forgot to mention that we are running
>> 18.2.2.
>>
>> Michel
>>
>> Le 26/03/2025 à 21:51, Michel Jouvin a écrit :
>>> Hi again,
>>>
>>> Looking for more info on the degraded filesystem, I managed to
>>> connect to the dashboard where I see an error not reported as
>>> explicitely by 'ceph health' :
>>>
>>> One or more metadata daemons (MDS ranks) are failed or in a damaged
>>> state. At best the filesystem is partially available, at worst the
>>> filesystem is completely unusable.
>>>
>>> But I don't manage what can be done from this point... and I really
>>> don't understand how we ended up in such a state...
>>>
>>> Cheers,
>>>
>>> Michel
>>>
>>> Le 26/03/2025 à 21:27, Michel Jouvin a écrit :
>>>> Hi,
>>>>
>>>> We have a production cluster made of 3 mon+mgr, 18 OSD servers and
>>>> ~500 OSDs and configured with ~50 pools, 1/2 EC (9+6) and 1/2
>>>> replica 3. It also has 2 CephFS filesystems with 1 MDS each.
>>>>
>>>> 2 days ago, in a period spanning 16 hours, 13 OSD crashed with an
>>>> OOM. The OSD were first restarted but it was decided to reboot the
>>>> server with a crashed OSD and "by mistake" (it was at least
>>>> useless), the OSD of the rebooted server were set noout,norebalance
>>>> before the reboot. The flags were removed after the reboot.
>>>>
>>>> After all of this, 'ceph -s' started to report a lot of misplaced PG
>>>> and recovery started. All the PGs but one were successfully
>>>> reactivated. One stayed in the activating+remapped state (located in
>>>> a pool used for tests). 'ceph health' (I don't put the details here
>>>> to avoid a too long mail but I can shared them) says:
>>>>
>>>> HEALTH_WARN 1 failed cephadm daemon(s); 1 filesystem is degraded; 2
>>>> MDSs report slow metadata IOs; Reduced data availability: 1 pg
>>>> inactive; 13 daemons have recently crashed
>>>>
>>>> and reports about one of the filesystem being degraded despite the
>>>> only PG reported inactive is not part of a pool related to the FS.
>>>>
>>>> The recovery was slow until we realized we should change the mclock
>>>> profile to high_recovery_ops. Then it completed in a few hours.
>>>> Unfortunately the degraded filesystem remains degraded without an
>>>> obvious reason... and the inactive page is still in the
>>>> activating+remapped state. We have not been able to identify a
>>>> relevant error in the logs up to now (but we may have missed
>>>> something...).
>>>>
>>>> So far we have avoided restarting too many things until we have a
>>>> better understanding of what happened and what is the current state.
>>>> We only restarted the mgr which was using a lot of CPU and the MDS
>>>> for the degraded FS, without any improvement.
>>>>
>>>> We are looking on advices about where to start... It seems we have
>>>> (at least) 2 independent problems:
>>>>
>>>> - A PG that cannot be reactivated with a remap operation that
>>>> doesn't proceed: would stopping osd.17 help (so that osd.460 is
>>>> reused)?
>>>>
>>>> [root@ijc-mon1 ~]# ceph pg dump_stuck
>>>> PG_STAT  STATE                UP            UP_PRIMARY
>>>> ACTING         ACTING_PRIMARY
>>>> 32.7ef   activating+remapped  [100,154,17]         100
>>>> [100,154,460]             100
>>>>
>>>> - 1 degraded filesystem: where to look for a reason?
>>>>
>>>> Thanks in advance for any help?
>>>>
>>>> Cheers,
>>>>
>>>> Michel
>>>> _______________________________________________
>>>> ceph-users mailing list -- ceph-users@ceph.io
>>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users@ceph.io
>>> To unsubscribe send an email to ceph-users-le...@ceph.io
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Production cluster in bad shape after several OSD crashes

Reply via email to