We experienced a Ceph failure causing the system to become unresponsive with no
IOPS or throughput due to a problematic OSD process on one node. This resulted
in slow operations and no IOPS for all other OSDs in the cluster. The incident
timeline is as follows:
Alert triggered for OSD problem.
We encountered a Ceph failure where the system became unresponsive with no IOPS
or throughput after encountering a failed node. Upon investigation, it appears
that the OSD process on one of the Ceph storage nodes is stuck, but ping is
still responsive. However, during the failure, Ceph was unabl
Hi Mark,
Thanks for your response, it is help!
Our Ceph cluster use Samsung SSD 870 EVO all backed with NVME drive. 12 SSD
drives to 2 NVMe drives per storage node. Each 4TB SSD backed 283G NVMe lvm
partition as DB.
Now cluster throughput only 300M write, and around 5K IOPS. I could see NVMe
d
I have my ceph IOPS very low with over 48 SSD backed on NVMs for DB/WAL on four
physical servers. The whole cluster has only about 20K IO total. Looks the IOs
are suppressed over bottleneck somewhere. Dstat shows a lots csw and interrupts
over 150K, while I am using FIO bench 4K 128QD test.
I c