Hello, we had sometimes problem with OSD in our all-flash NVMe CEPH cluster.
Our cluster is made by cephadm, using Debian bookworm as a "basic" system and podman as a container solution. We are using two-ports Mellanox 100Gbps adapters: # lspci | grep Mellanox 43:00.0 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex] 43:00.1 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex] We are using TCP/IP for front-end communication and RDMA for internal communication. Sometimes (after cca 2-3 month of normal working) one OSD container from whole machine have a problem to connect to other OSD processes, but it is still connected to MON process, which goes to the situation, when it is "half-dead" - it cannot connects to other OSD, other OSD cannot connect to it, but it reports to the MON process, that it is still alive. We have experience, that this situation can resolve by restart container of this OSD. We have script, which is running every 5 minute and which searching in logs for this pattern: start_waiting_for_healthy If is this pattern found, we restart of container with appropriate OSD daemon. After restart this container, CEPH cluster goes quickly to HEALTHY state. I'm adding log, where this situation occured, up to restarting container. It was an osd.12 container with PID 3920171. In my opinion there is some problem to open infiniband connection - some problems in infiniband stack... :-( I'm open to you questions, although I cannot do some special operations, because it is our production cluster... Many thanks for this software :-) Sincerely Jan Marek -- Ing. Jan Marek University of South Bohemia Academic Computer Centre Phone: +420389032080 http://www.gnu.org/philosophy/no-word-attachments.cs.html
problem.log.bz2
Description: Binary data
signature.asc
Description: PGP signature
_______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io