Hello,

we have ceph cluster which consists of 12 host, on every host we
have 12 NVMe "disks".

On most of these host (9 of 12) we have in logs errors, see
attached file.

We tried to check this problem, and we have these points:

1) On every host there is only one OSD. Thus it's not problem in
version 18.2.2 generally, because there will be on another OSD,
not only one of host?

2) Sometimes one of this OSD crashed :-( It seems, that crashed
OSD are from set of OSDs, which have this problem.

3) ceph cluster goes OK and it "doesn't know" about any problem
with these OSD. It seem's, that this new instance of ceph-osd
daemon tried to start either podman or conmon itself. We've tried
to control PID files for conman, but they're seems to be OK?

4) We tried to check 'ceph orch' command, but it does not try to
start these containers, because it know, that they exists and run
('ceph orch ps' list these containers as running).

5) I've tried to pause ochestrator, but I've still found in syslog
these entries... :-(

Please, is there any possibility to find out, where is problem
and stop this?

We have all of the ceph host prepared by ansible, thus there is
the same environment.

On every machine we have podman version 4.3.1+ds1-8+deb12u1 and
conmon version 2.1.6+ds1-1. OS is Debian bookworm.

Attached logs was prepared by:

grep exec_died /var/log/syslog

Sincerely
Jan Marek
--  
Ing. Jan Marek
University of South Bohemia
Academic Computer Centre
Phone: +420389032080
http://www.gnu.org/philosophy/no-word-attachments.cs.html

Attachment: podman.log.gz
Description: application/gzip

Attachment: signature.asc
Description: PGP signature

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to