I really question the use of rdma with Ceph...at least currently. Sure pure rdma latency ( 0.005 ms ?) is much better than tcp (0.03-0.05 ms), so it will be better when say you are serving i/o from a single raw nvme drive, but in an SDS system like Ceph where OSD latency is around 0.3 ms read, 1 ms write, we are already an order of magnitude higher than tcp, so using rdma rather than tcp will not have a noticeable impact and may not be worth the hassle. This ofcourse may change in the future if OSD latencies drop, but this will not be any time soon.
On 03/06/2025 12:07, Jan Marek wrote:
Hello, we had sometimes problem with OSD in our all-flash NVMe CEPH cluster. Our cluster is made by cephadm, using Debian bookworm as a "basic" system and podman as a container solution. We are using two-ports Mellanox 100Gbps adapters: # lspci | grep Mellanox 43:00.0 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex] 43:00.1 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex] We are using TCP/IP for front-end communication and RDMA for internal communication. Sometimes (after cca 2-3 month of normal working) one OSD container from whole machine have a problem to connect to other OSD processes, but it is still connected to MON process, which goes to the situation, when it is "half-dead" - it cannot connects to other OSD, other OSD cannot connect to it, but it reports to the MON process, that it is still alive. We have experience, that this situation can resolve by restart container of this OSD. We have script, which is running every 5 minute and which searching in logs for this pattern: start_waiting_for_healthy If is this pattern found, we restart of container with appropriate OSD daemon. After restart this container, CEPH cluster goes quickly to HEALTHY state. I'm adding log, where this situation occured, up to restarting container. It was an osd.12 container with PID 3920171. In my opinion there is some problem to open infiniband connection - some problems in infiniband stack... :-( I'm open to you questions, although I cannot do some special operations, because it is our production cluster... Many thanks for this software :-) Sincerely Jan Marek _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io