On Tue, Mar 04, 2025 at 06:46:20PM +, Eugen Block wrote:
> > It's almost always the network ;-)
>
> I know, I have memorized your famous tweet about Ceph being the best network
> monitor 😄
It seems to be ;-)
When I spun up my small cluster, I used a noname 10G switch. Ceph
complained bitterl
It's almost always the network ;-)
I know, I have memorized your famous tweet about Ceph being the best
network monitor 😄
and there hasn’t been a single week that I haven’t thought about that. 🙂
Zitat von Dan van der Ster :
it's not Ceph but the network
It's almost always the network ;-)
> it's not Ceph but the network
It's almost always the network ;-)
Ramin: This reminds me of an outage we had at CERN caused by routing /
ECMP / faulty line card.
One of the main symptoms of that is high tcp retransmits on the Ceph nodes.
Basically, OSDs keep many connections open with each othe
A few years ago, one of our customers complained about latency issues.
We investigated and the only real evidence we found were also high
retransmit values. So we recommended to let their network team look
into it. For months they refused to do anything, until they hired
another company to
I think using the dashboard you can check for incorrect MTU settings, that
is sometimes an issue.
Brett
On Mon, Mar 3, 2025 at 12:42 PM Ramin Najjarbashi <
ramin.najarba...@gmail.com> wrote:
> The Ceph version is 17.2.7.
>
>
> • OSDs are a mix of SSD and HDD, with DB/WAL colocated on the same OS
The Ceph version is 17.2.7.
• OSDs are a mix of SSD and HDD, with DB/WAL colocated on the same OSDs.
• SSDs are used for metadata and index pools with replication 3.
• HDDs store the data pool using EC 4+2.
Interestingly, the same issue has appeared on another cluster where DB/WAL
is placed o
On 01-03-2025 15:10, Ramin Najjarbashi wrote:
Hi
We are currently facing severe latency issues in our Ceph cluster,
particularly affecting read and write operations. At times, write
operations completely stall, leading to significant service degradation.
Below is a detailed breakdown of the issue
> Network Issues: Packet drops and a high number of TCP retransmits were
> identified.
Are you overflowing nf_conntrack?
Look for layer 1 issues: cables, seating, RAM errors.
Update NIC firmware
Do you have deep C-states disabled?
___
ceph-user