I really question the use of rdma with Ceph...at least currently. Sure pure rdma latency ( 0.005 ms ?) is much better than tcp (0.03-0.05 ms), so it will be better when say you  are serving i/o from a single raw nvme drive, but in an SDS system like Ceph where OSD latency is around 0.3 ms read, 1 ms write, we are already an order of magnitude higher than tcp, so using rdma rather than tcp will not have a noticeable impact and may not be worth the hassle. This ofcourse may change in the future if OSD latencies drop, but this will not be any time soon.


On 03/06/2025 12:07, Jan Marek wrote:
Hello,

we had sometimes problem with OSD in our all-flash NVMe CEPH
cluster.

Our cluster is made by cephadm, using Debian bookworm as a
"basic" system and podman as a container solution.

We are using two-ports Mellanox 100Gbps adapters:

# lspci | grep Mellanox
43:00.0 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 
Ex]
43:00.1 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 
Ex]

We are using TCP/IP for front-end communication and
RDMA for internal communication.

Sometimes (after cca 2-3 month of normal working) one OSD
container from whole machine have a problem to connect to other
OSD processes, but it is still connected to MON process, which
goes to the situation, when it is "half-dead" - it cannot
connects to other OSD, other OSD cannot connect to it, but it
reports to the MON process, that it is still alive.

We have experience, that this situation can resolve by restart
container of this OSD.

We have script, which is running every 5 minute and which
searching in logs for this pattern:

start_waiting_for_healthy

If is this pattern found, we restart of container with
appropriate OSD daemon.

After restart this container, CEPH cluster goes quickly to
HEALTHY state.

I'm adding log, where this situation occured, up to restarting
container. It was an osd.12 container with PID 3920171.

In my opinion there is some problem to open infiniband connection
- some problems in infiniband stack... :-(

I'm open to you questions, although I cannot do some special
operations, because it is our production cluster...

Many thanks for this software :-)

Sincerely
Jan Marek

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to