[ceph-users] Flapping OSDs on pacific 16.2.10

J-P Methot Wed, 18 Jan 2023 04:43:32 -0800

Hi,

We have a full SSD production cluster running on Pacific 16.2.10 anddeployed with cephadm that is experiencing OSD flapping issues.Essentially, random OSDs will get kicked out of the cluster and thenautomatically brought back in a few times a day. As an example, let'stake the case of OSD.184 :

-It flapped 9 times between January 15th and 17th with the following logmessage each time : 2023-01-15T16:33:19.903+0000 prepare_failureosd.184 from osd.49 is reporting failure:1

-On January 17th, it complains that there are slow ops and spam its logswith the following line : heartbeat_map is_healthy 'OSD::osd_op_tpthread 0x7f346aa64700' had timed out after 15.000000954s

The storage node itself has over 30 GB of ram still available in cacheand the drives themselves only seldom peak at 100% usage and that neverlasts more than a few seconds. CPU usage is also constantly around 5%.Considering there is no other error messages in any of the regular logs,including the systemd logs, why would this OSD not reply to heartbeats?


--
Jean-Philippe Méthot
Senior Openstack system administrator
Administrateur système Openstack sénior
PlanetHoster inc.
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Flapping OSDs on pacific 16.2.10

Reply via email to