[ceph-users] Re: Severe Latency Issues in Ceph Cluster

2025-03-04 Thread Alexander Schreiber
On Tue, Mar 04, 2025 at 06:46:20PM +, Eugen Block wrote: > > It's almost always the network ;-) > > I know, I have memorized your famous tweet about Ceph being the best network > monitor 😄 It seems to be ;-) When I spun up my small cluster, I used a noname 10G switch. Ceph complained bitterl

[ceph-users] Re: Severe Latency Issues in Ceph Cluster

2025-03-04 Thread Eugen Block
It's almost always the network ;-) I know, I have memorized your famous tweet about Ceph being the best network monitor 😄 and there hasn’t been a single week that I haven’t thought about that. 🙂 Zitat von Dan van der Ster : it's not Ceph but the network It's almost always the network ;-)

[ceph-users] Re: Severe Latency Issues in Ceph Cluster

2025-03-04 Thread Dan van der Ster
> it's not Ceph but the network It's almost always the network ;-) Ramin: This reminds me of an outage we had at CERN caused by routing / ECMP / faulty line card. One of the main symptoms of that is high tcp retransmits on the Ceph nodes. Basically, OSDs keep many connections open with each othe

[ceph-users] Re: Severe Latency Issues in Ceph Cluster

2025-03-04 Thread Eugen Block
A few years ago, one of our customers complained about latency issues. We investigated and the only real evidence we found were also high retransmit values. So we recommended to let their network team look into it. For months they refused to do anything, until they hired another company to

[ceph-users] Re: Severe Latency Issues in Ceph Cluster

2025-03-03 Thread Brett Niver
I think using the dashboard you can check for incorrect MTU settings, that is sometimes an issue. Brett On Mon, Mar 3, 2025 at 12:42 PM Ramin Najjarbashi < ramin.najarba...@gmail.com> wrote: > The Ceph version is 17.2.7. > > > • OSDs are a mix of SSD and HDD, with DB/WAL colocated on the same OS

[ceph-users] Re: Severe Latency Issues in Ceph Cluster

2025-03-03 Thread Ramin Najjarbashi
The Ceph version is 17.2.7. • OSDs are a mix of SSD and HDD, with DB/WAL colocated on the same OSDs. • SSDs are used for metadata and index pools with replication 3. • HDDs store the data pool using EC 4+2. Interestingly, the same issue has appeared on another cluster where DB/WAL is placed o

[ceph-users] Re: Severe Latency Issues in Ceph Cluster

2025-03-03 Thread Stefan Kooman
On 01-03-2025 15:10, Ramin Najjarbashi wrote: Hi We are currently facing severe latency issues in our Ceph cluster, particularly affecting read and write operations. At times, write operations completely stall, leading to significant service degradation. Below is a detailed breakdown of the issue

[ceph-users] Re: Severe Latency Issues in Ceph Cluster

2025-03-01 Thread Anthony D'Atri
> Network Issues: Packet drops and a high number of TCP retransmits were > identified. Are you overflowing nf_conntrack? Look for layer 1 issues: cables, seating, RAM errors. Update NIC firmware Do you have deep C-states disabled? ___ ceph-user