[ceph-users] Re: Cluster Dead After Network Failure. Connection Timeout.

Eugen Block via ceph-users Wed, 27 May 2026 10:31:47 -0700

Please don't drop the list from your responses, you'd benefit frommore people reading it.

The cephadm ls output isn't really helpful, you need to figure out whydocker doesn't start. Either syslog, journald or dmesg or whatevershould give some clue. And to me it sounds like there has been moregoing on beside the network outage, maybe some leftovers from previousdeployments or tests or something else that "confuses" docker? Maybeyou have both podman and docker installed and since cephadm preferspodman, the containers fail to start?


Zitat von Jacek Rużyczka <[email protected]>:

I can't run the Ceph container on the master node (blade3n1) anymore. It's
not executed anymore without an error message. Here is what cephadm ls says:

mixtile@blade3n1:~$ sudo cephadm ls
[
   {
       "style": "cephadm:v1",
       "name": "mon.blade3n1",
       "fsid": "8aad3073-39a1-11f1-bf6e-f2704a1efa9b",
       "systemd_unit":
"[email protected]"
,
       "enabled": true,
       "state": "error",
       "service_name": "mon",
       "memory_request": null,
       "memory_limit": null,
       "ports": [],
       "container_id": null,
       "container_image_name": "quay.io/ceph/ceph:v19",
       "container_image_id": null,
       "container_image_digests": null,
       "version": null,
       "started": null,
       "created": "2026-04-16T14:35:47.634066Z",
       "deployed": "2026-04-16T14:35:45.414037Z",
       "configured": "2026-04-20T17:16:32.722329Z"
   },
   {
       "style": "cephadm:v1",
       "name": "node-exporter.blade3n1",
       "fsid": "8aad3073-39a1-11f1-bf6e-f2704a1efa9b",
       "systemd_unit":
"ceph-8aad3073-39a1-11f1-bf6e-f2704a1efa9b@node-exporter
.blade3n1",
       "enabled": true,
       "state": "error",
       "service_name": "node-exporter",
       "ports": [
           9100
       ],
       "ip": null,
       "deployed_by": [
           "
quay.io/ceph/ceph@sha256:af0c5903e901e329adabe219dfc8d0c3efc1f05102
a753902f33ee16c26b6cee"
       ],
       "rank": null,
       "rank_generation": null,
       "extra_container_args": null,
       "extra_entrypoint_args": null,
       "memory_request": null,
       "memory_limit": null,
       "container_id": null,
       "container_image_name": "quay.io/prometheus/node-exporter:v1.7.0",
       "container_image_id": null,
       "container_image_digests": null,
       "version": null,
       "started": null,
       "created": "2026-04-21T12:55:39.731035Z",
       "deployed": "2026-04-21T12:55:38.217675Z",
       "configured": "2026-04-21T12:55:39.734369Z"
   },
   {
       "style": "cephadm:v1",
       "name": "ceph-exporter.blade3n1",
       "fsid": "8aad3073-39a1-11f1-bf6e-f2704a1efa9b",
       "systemd_unit":
"ceph-8aad3073-39a1-11f1-bf6e-f2704a1efa9b@ceph-exporter
.blade3n1",
       "enabled": true,
       "state": "error",
       "service_name": "ceph-exporter",
       "ports": [],
       "ip": null,
       "deployed_by": [
           "
quay.io/ceph/ceph@sha256:af0c5903e901e329adabe219dfc8d0c3efc1f05102
a753902f33ee16c26b6cee"
       ],
       "rank": null,
       "rank_generation": null,
       "extra_container_args": null,
       "extra_entrypoint_args": null,
       "memory_request": null,
       "memory_limit": null,
       "container_id": null,
       "container_image_name": "
quay.io/ceph/ceph@sha256:af0c5903e901e329adabe2
19dfc8d0c3efc1f05102a753902f33ee16c26b6cee",
       "container_image_id": null,
       "container_image_digests": null,
       "version": null,
       "started": null,
       "created": "2026-04-16T14:37:32.218782Z",
       "deployed": "2026-04-16T14:37:30.612094Z",
       "configured": "2026-04-20T17:16:36.139048Z"
   },
   {
       "style": "cephadm:v1",
       "name": "mgr.blade3n1.rrlwwv",
       "fsid": "8aad3073-39a1-11f1-bf6e-f2704a1efa9b",
       "systemd_unit":
"[email protected].
rrlwwv",
       "enabled": true,
       "state": "error",
       "service_name": "mgr",
       "memory_request": null,
       "memory_limit": null,
       "ports": [
           9283,
           8765,
           8443
       ],
       "container_id": null,
       "container_image_name": "quay.io/ceph/ceph:v19",
       "container_image_id": null,
       "container_image_digests": null,
       "version": null,
       "started": null,
       "created": "2026-04-16T14:35:54.054151Z",
       "deployed": "2026-04-16T14:35:52.430796Z",
       "configured": "2026-04-20T17:16:37.612403Z"
   },
   {
       "style": "cephadm:v1",
       "name": "crash.blade3n1",
       "fsid": "8aad3073-39a1-11f1-bf6e-f2704a1efa9b",
       "systemd_unit":
"[email protected]
1",
       "enabled": true,
       "state": "error",
       "service_name": "crash",
       "ports": [],
       "ip": null,
       "deployed_by": [
           "
quay.io/ceph/ceph@sha256:af0c5903e901e329adabe219dfc8d0c3efc1f05102
a753902f33ee16c26b6cee"
       ],
       "rank": null,
       "rank_generation": null,
       "extra_container_args": null,
       "extra_entrypoint_args": null,
       "memory_request": null,
       "memory_limit": null,
       "container_id": null,
       "container_image_name": "
quay.io/ceph/ceph@sha256:af0c5903e901e329adabe2
19dfc8d0c3efc1f05102a753902f33ee16c26b6cee",
       "container_image_id": null,
       "container_image_digests": null,
       "version": null,
       "started": null,
       "created": "2026-04-16T14:37:36.855510Z",
       "deployed": "2026-04-16T14:37:35.268822Z",
       "configured": "2026-04-20T17:16:39.025758Z"
   },
   {
       "style": "cephadm:v1",
       "name": "osd.3",
       "fsid": "8aad3073-39a1-11f1-bf6e-f2704a1efa9b",
       "systemd_unit": "[email protected]",
       "enabled": true,
       "state": "error",
       "service_name": "osd",
       "ports": [],
       "ip": null,
       "deployed_by": [
           "
quay.io/ceph/ceph@sha256:af0c5903e901e329adabe219dfc8d0c3efc1f05102
a753902f33ee16c26b6cee"
       ],
       "rank": null,
       "rank_generation": null,
       "extra_container_args": null,
       "extra_entrypoint_args": null,
       "memory_request": null,
       "memory_limit": null,
       "container_id": null,
       "container_image_name": "
quay.io/ceph/ceph@sha256:af0c5903e901e329adabe2
19dfc8d0c3efc1f05102a753902f33ee16c26b6cee",
       "container_image_id": null,
       "container_image_digests": null,
       "version": null,
       "started": null,
       "created": "2026-04-23T15:05:00.686688Z",
       "deployed": "2026-04-23T15:04:59.176667Z",
       "configured": "2026-04-23T15:05:00.693355Z"
   },
   {
       "style": "cephadm:v1",
       "name": "mds.data.blade3n1.eczeqc",
       "fsid": "8aad3073-39a1-11f1-bf6e-f2704a1efa9b",
       "systemd_unit":
"[email protected]
e3n1.eczeqc",
       "enabled": true,
       "state": "error",
       "service_name": "mds.data",
       "ports": [],
       "ip": null,
       "deployed_by": [
           "
quay.io/ceph/ceph@sha256:af0c5903e901e329adabe219dfc8d0c3efc1f05102
a753902f33ee16c26b6cee"
       ],
       "rank": null,
       "rank_generation": null,
       "extra_container_args": null,
       "extra_entrypoint_args": null,
       "memory_request": null,
       "memory_limit": null,
       "container_id": null,
       "container_image_name": "
quay.io/ceph/ceph@sha256:af0c5903e901e329adabe2
19dfc8d0c3efc1f05102a753902f33ee16c26b6cee",
       "container_image_id": null,
       "container_image_digests": null,
       "version": null,
       "started": null,
       "created": "2026-04-16T15:54:13.264224Z",
       "deployed": "2026-04-16T15:54:10.870858Z",
       "configured": "2026-04-20T17:16:40.499113Z"
   }
]


Am Mi., 27. Mai 2026 um 15:07 Uhr schrieb Jacek Rużyczka <
[email protected]>:

Hi Eugen,

You might need to run 'systemctl reset-failed...' to let systemd start the

containers.



I've already done that. No use. Even worse: On node #1, Docker no longer
starts. When trying to restart the daemon, I get errors like this:

docker.service: Failed with result 'core-dump'.

But before you do that, do you have MON logs with an explanation why they

refuse to start?



Unfortunately no, not even in the syslog. In the meantime, I was able to
start another MON via Cephadm (because the Docker instance had even deleted
the image), but now I've got the problem with the one node, where Docker
refuses to start.

Regarding Ceph images, your cluster uses af0c5903e901 for the Ceph

services, what does 'docker images | grep af0c5903e901' show?



On the affected node, nothing 'cause the Docker daemon wouldn't even start.

I have the impression that this is a "regular" cephadm cluster


True

BTW, when running the test script supplied by the Docker guys
https://docs.docker.com/engine/daemon/troubleshoot/, I get some warnings:

- Network Drivers:
 - "bridge":
   - sysctl net.ipv4.ip_forward: disabled
   - sysctl net.ipv6.conf.all.forwarding: disabled
   - sysctl net.ipv6.conf.default.forwarding: disabled

On nodes #2 thru #4, net.ipv4.ip_forward is enabled.

Regards
Jacek Rużyczka



_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[ceph-users] Re: Cluster Dead After Network Failure. Connection Timeout.

Reply via email to