Hi,
The network my 4-node cluster uses broke down after a driver issue a week
ago. Now, as the network resumed normal operation, but my Ceph 19 cluster
first said HEALTH_WARN and informed me of a lengthy recovery process, but
some one hour later, I only found this error message:
mixtile@blade3n1:~$ sudo ceph -s
[sudo] password for mixtile:
2026-05-25T13:35:51.685+0200 ffff9701f180 0 monclient(hunting):
authenticate timed out after 300
[errno 110] RADOS timed out (error connecting to the cluster)
A restart of all nodes did *not* help. Even worse: The Docker containers
with the various processes (mon, mrg, crash,…) started disappearing one by
one! Here is what remained:
mixtile@blade3n1:~$ docker ps -a
CONTAINER ID IMAGE COMMAND
CREATED STATUS PORTS NAMES
2039b18ba392 quay.io/prometheus/node-exporter:v1.7.0
"/bin/node_exporter …" 47 minutes ago Up 47 minutes
ceph-8aad3073-39a1-11f1-b
f6e-f2704a1efa9b-node-exporter-blade3n1
16cb4a6822f2 af0c5903e901
"/usr/bin/ceph-crash…" 47 minutes ago Up 47
minutes ceph-8aad3073-39a1-11f1-b
f6e-f2704a1efa9b-crash-blade3n1
mixtile@blade3n2:~$ docker ps -a
CONTAINER ID IMAGE COMMAND
CREATED STATUS PORTS NAMES
42361a694abf quay.io/prometheus/prometheus:v2.51.0 "/bin/prometheus
--c…" 2 hours ago Up 2 hours
ceph-8aad3073-39a1-11f1-bf6
e-f2704a1efa9b-prometheus-blade3n2
88b085d000a8 quay.io/prometheus/node-exporter:v1.7.0
"/bin/node_exporter …" 2 hours ago Up 2 hours
ceph-8aad3073-39a1-11f1-bf6
e-f2704a1efa9b-node-exporter-blade3n2
7bb17808bdd8 quay.io/prometheus/alertmanager:v0.25.0 "/bin/alertmanager
-…" 2 hours ago Up 2 hours
ceph-8aad3073-39a1-11f1-bf6
e-f2704a1efa9b-alertmanager-blade3n2
a43cfe36da29 quay.io/ceph/grafana:10.4.0 "/run.sh"
2 hours ago Up 2 hours
ceph-8aad3073-39a1-11f1-bf6
e-f2704a1efa9b-grafana-blade3n2
a95140b707ee af0c5903e901
"/usr/bin/ceph-crash…" 2 hours ago Up 2
hours ceph-8aad3073-39a1-11f1-bf6
e-f2704a1efa9b-crash-blade3n2
38d4ca4035b0 af0c5903e901 "/usr/bin/ceph-mon
-…" 2 hours ago Up 2 hours
ceph-8aad3073-39a1-11f1-bf6
e-f2704a1efa9b-mon-blade3n2
mixtile@blade3n3:~$ docker ps -a
CONTAINER ID IMAGE COMMAND
CREATED STATUS PORTS NAMES
d664dfe30bd8 af0c5903e901
"/usr/bin/ceph-crash…" 2 hours ago Up 2
hours ceph-8aad3073-39a1-11f1-bf6
e-f2704a1efa9b-crash-blade3n3
a64ac00dc28b quay.io/prometheus/node-exporter:v1.7.0
"/bin/node_exporter …" 2 hours ago Up 2 hours
ceph-8aad3073-39a1-11f1-bf6
e-f2704a1efa9b-node-exporter-blade3n3
f7de98403a10 netdata/netdata "/usr/sbin/run.sh"
mixtile@blade3n4:~$ docker ps -a
CONTAINER ID IMAGE COMMAND
CREATED STATUS PORTS NAMES
d437cec7d6bf quay.io/prometheus/node-exporter:v1.7.0
"/bin/node_exporter …" 54 minutes ago Up 53 minutes
ceph-8aad3073-39a1-11f1-b
f6e-f2704a1efa9b-node-exporter-blade3n4
c6d5ac595857 af0c5903e901
"/usr/bin/ceph-crash…" 54 minutes ago Up 53
minutes ceph-8aad3073-39a1-11f1-b
f6e-f2704a1efa9b-crash-blade3n4
As you can see, there are much less Ceph-related processes than expected.
The rest hasn't only crashed: In fact, the corresponding images have also
disappeared! Pulling the missing containers didn't work:
mixtile@blade3n1:~$ docker run ceph-mon
Unable to find image 'ceph-mon:latest' locally
docker: Error response from daemon: pull access denied for ceph-mon,
repository does not exist or may require 'docker login'
This is my exact system and Ceph version BTW:
mixtile@blade3n1:~$ ceph -v
ceph version 19.2.3 (c92aebb279828e9c3c1f5d24613efca272649e62) squid
(stable)
mixtile@blade3n1:~$ uname -a
Linux blade3n1 6.1.0-1027-rockchip #27 SMP Sun Apr 27 01:54:34 UTC 2025
aarch64 aarch64 aarch64 GNU/Linux
The drives my data are stored on seem to be still there (as lsblk said),
but as all most OSD processes are gone, I can no longer access them. I've
got four hosts, of which #1 is the admin node. #2 also hosts Ganesha NFS
for external clients.
So: What can I do to bring my cluster back to life without endangering my
data? Thank you.
Kind regards
Jacek Rużyczka
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]