[ceph-users] OSD down cause all OSD slow ops

2023-03-30 Thread petersun
We experienced a Ceph failure causing the system to become unresponsive with no IOPS or throughput due to a problematic OSD process on one node. This resulted in slow operations and no IOPS for all other OSDs in the cluster. The incident timeline is as follows: Alert triggered for OSD problem.

[ceph-users] Ceph Failure and OSD Node Stuck Incident

2023-03-30 Thread petersun
We encountered a Ceph failure where the system became unresponsive with no IOPS or throughput after encountering a failed node. Upon investigation, it appears that the OSD process on one of the Ceph storage nodes is stuck, but ping is still responsive. However, during the failure, Ceph was unabl

[ceph-users] Re: ceph cluster iops low

2023-01-24 Thread petersun
Hi Mark, Thanks for your response, it is help! Our Ceph cluster use Samsung SSD 870 EVO all backed with NVME drive. 12 SSD drives to 2 NVMe drives per storage node. Each 4TB SSD backed 283G NVMe lvm partition as DB. Now cluster throughput only 300M write, and around 5K IOPS. I could see NVMe d

[ceph-users] ceph cluster iops low

2023-01-23 Thread petersun
I have my ceph IOPS very low with over 48 SSD backed on NVMs for DB/WAL on four physical servers. The whole cluster has only about 20K IO total. Looks the IOs are suppressed over bottleneck somewhere. Dstat shows a lots csw and interrupts over 150K, while I am using FIO bench 4K 128QD test. I c