Hello, we have ceph cluster which consists of 12 host, on every host we have 12 NVMe "disks".
On most of these host (9 of 12) we have in logs errors, see attached file. We tried to check this problem, and we have these points: 1) On every host there is only one OSD. Thus it's not problem in version 18.2.2 generally, because there will be on another OSD, not only one of host? 2) Sometimes one of this OSD crashed :-( It seems, that crashed OSD are from set of OSDs, which have this problem. 3) ceph cluster goes OK and it "doesn't know" about any problem with these OSD. It seem's, that this new instance of ceph-osd daemon tried to start either podman or conmon itself. We've tried to control PID files for conman, but they're seems to be OK? 4) We tried to check 'ceph orch' command, but it does not try to start these containers, because it know, that they exists and run ('ceph orch ps' list these containers as running). 5) I've tried to pause ochestrator, but I've still found in syslog these entries... :-( Please, is there any possibility to find out, where is problem and stop this? We have all of the ceph host prepared by ansible, thus there is the same environment. On every machine we have podman version 4.3.1+ds1-8+deb12u1 and conmon version 2.1.6+ds1-1. OS is Debian bookworm. Attached logs was prepared by: grep exec_died /var/log/syslog Sincerely Jan Marek -- Ing. Jan Marek University of South Bohemia Academic Computer Centre Phone: +420389032080 http://www.gnu.org/philosophy/no-word-attachments.cs.html
podman.log.gz
Description: application/gzip
signature.asc
Description: PGP signature
_______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io