OK. I couldn't find a quick way to shovel a largish file from an internal server into pastebin, but my own servers can suffice.

the URLs are:

https://www.mousetech.com/share/cephadm.log

https://www.mousetech.com/share/cephadm.log

And I don't see a deployment either.


On 3/26/25 14:26, Eugen Block wrote:
That would be the correct log file, but I don't see an attempt to deploy a prometheus instance there. You can use any pastebin you like, e. g. https://pastebin.com/ to upload your logs. Mask any sensitive data before you do that.


Zitat von Tim Holloway <t...@mousetech.com>:

Well, here's an excerpt from the /var/log/ceph/cephadm.log. I don't know if that's the mechanism or file you mean, though.


2025-03-26 13:11:09,382 7fb2abc38740 DEBUG --------------------------------------------------------------------------------
cephadm ['--no-container-init', '--timeout', '895', 'gather-facts']
2025-03-26 13:12:10,219 7fc4fd405740 DEBUG --------------------------------------------------------------------------------
cephadm ['--no-container-init', '--timeout', '895', 'gather-facts']
2025-03-26 13:13:11,502 7f2ef3c76740 DEBUG --------------------------------------------------------------------------------
cephadm ['--no-container-init', '--timeout', '895', 'gather-facts']
2025-03-26 13:14:12,372 7f3566bef740 DEBUG --------------------------------------------------------------------------------
cephadm ['--no-container-init', '--timeout', '895', 'gather-facts']
2025-03-26 13:15:13,301 7f660e204740 DEBUG --------------------------------------------------------------------------------
cephadm ['--no-container-init', '--timeout', '895', 'gather-facts']
2025-03-26 13:15:20,880 7f93b227e740 DEBUG --------------------------------------------------------------------------------
cephadm ['ls']
2025-03-26 13:15:20,904 7f93b227e740 DEBUG /usr/bin/podman: 5.2.2
2025-03-26 13:15:20,939 7f93b227e740 DEBUG /usr/bin/podman: 2149e16fa2ce,11.51MB / 33.24GB 2025-03-26 13:15:20,939 7f93b227e740 DEBUG /usr/bin/podman: 65529d6ad1ac,17.69MB / 33.24GB 2025-03-26 13:15:20,939 7f93b227e740 DEBUG /usr/bin/podman: 51b1d190dfb9,99.79MB / 33.24GB 2025-03-26 13:15:20,939 7f93b227e740 DEBUG /usr/bin/podman: 59a865e3bcc5,6.791MB / 33.24GB 2025-03-26 13:15:20,939 7f93b227e740 DEBUG /usr/bin/podman: dd3203f6f3bb,410.2MB / 33.24GB 2025-03-26 13:15:20,939 7f93b227e740 DEBUG /usr/bin/podman: 34177c4e5761,1.764GB / 33.24GB 2025-03-26 13:15:20,939 7f93b227e740 DEBUG /usr/bin/podman: bfe17e83b288,534.2MB / 33.24GB 2025-03-26 13:15:20,972 7f93b227e740 DEBUG /usr/bin/podman: 2149e16fa2ce,0.00% 2025-03-26 13:15:20,972 7f93b227e740 DEBUG /usr/bin/podman: 65529d6ad1ac,0.26% 2025-03-26 13:15:20,972 7f93b227e740 DEBUG /usr/bin/podman: 51b1d190dfb9,0.22% 2025-03-26 13:15:20,972 7f93b227e740 DEBUG /usr/bin/podman: 59a865e3bcc5,0.02% 2025-03-26 13:15:20,972 7f93b227e740 DEBUG /usr/bin/podman: dd3203f6f3bb,0.86% 2025-03-26 13:15:20,972 7f93b227e740 DEBUG /usr/bin/podman: 34177c4e5761,1.67% 2025-03-26 13:15:20,972 7f93b227e740 DEBUG /usr/bin/podman: bfe17e83b288,0.25%
2025-03-26 13:15:20,985 7f93b227e740 DEBUG systemctl: enabled
2025-03-26 13:15:20,993 7f93b227e740 DEBUG systemctl: active
2025-03-26 13:15:21,024 7f93b227e740 DEBUG /usr/bin/podman: 2149e16fa2ce8769bf3be9e6e25eec61b8e027b0e8699f1cb7d5f113fc4aac66,quay.io/prometheus/node-exporter:v1.5.0,0da6a335fe1356545476b749c68f022c897d
e3a2139e8f0054f6937349ee2b83,2025-03-25 16:52:31.644234532 -0400 EDT,
2025-03-26 13:15:21,057 7f93b227e740 DEBUG /usr/bin/podman: [quay.io/prometheus/node-exporter@sha256:39c642b2b337e38c18e80266fb14383754178202f40103646337722a594d984c quay.io/prometheus/node-exporter@sh
a256:fa8e5700b7762fffe0674e944762f44bb787a7e44d97569fe55348260453bf80]
2025-03-26 13:15:21,111 7f93b227e740 DEBUG /usr/bin/podman: node_exporter, version 1.5.0 (branch: HEAD, revision: 1b48970ffcf5630534fb00bb0687d73c66d1c959) 2025-03-26 13:15:21,111 7f93b227e740 DEBUG /usr/bin/podman: build user:       root@6e7732a7b81b 2025-03-26 13:15:21,111 7f93b227e740 DEBUG /usr/bin/podman: build date:       20221129-18:59:09 2025-03-26 13:15:21,111 7f93b227e740 DEBUG /usr/bin/podman:   go version:       go1.19.3 2025-03-26 13:15:21,111 7f93b227e740 DEBUG /usr/bin/podman: platform:         linux/amd64
2025-03-26 13:15:21,187 7f93b227e740 DEBUG systemctl: enabled
2025-03-26 13:15:21,196 7f93b227e740 DEBUG systemctl: active
2025-03-26 13:15:21,228 7f93b227e740 DEBUG /usr/bin/podman: 59a865e3bcc5e86f6caed8278aec0cfed608bf89ff4953dfb48b762138955925,quay.io/ceph/ceph@sha256:6ac7f923aa1d23b43248ce0ddec7e1388855ee3d00813b52c31 72b0b23b37906,2bc0b0f4375ddf4270a9a865dfd4e53063acc8e6c3afd7a2546507cafd2ec86a,2025-03-25 16:52:31.731849052 -0400 EDT, 2025-03-26 13:15:21,260 7f93b227e740 DEBUG /usr/bin/podman: [quay.io/ceph/ceph@sha256:6ac7f923aa1d23b43248ce0ddec7e1388855ee3d00813b52c3172b0b23b37906 quay.io/ceph/ceph@sha256:ac06cdca6f2512a763f1ace85
53330e454152b82f95a2b6bf33c3f3ec2eeac77]
2025-03-26 13:15:21,385 7f93b227e740 DEBUG /usr/bin/podman: ceph version 18.2.4 (e7ad5345525c7aa95470c26863873b581076945d) reef (stable)
:2025-03-26 13:15:21,412 7f93b227e740 DEBUG systemctl: enabled
2025-03-26 13:15:21,421 7f93b227e740 DEBUG systemctl: active
2025-03-26 13:15:21,451 7f93b227e740 DEBUG /usr/bin/podman: bfe17e83b28821be0ec399cde79965ade3bc3377c5acf05ef047395ddde4d804,quay.io/ceph/ceph@sha256:6ac7f923aa1d23b43248ce0ddec7e1388855ee3d00813b52c31 72b0b23b37906,2bc0b0f4375ddf4270a9a865dfd4e53063acc8e6c3afd7a2546507cafd2ec86a,2025-03-26 06:53:07.022104802 -0400 EDT,
2025-03-26 13:15:21,464 7f93b227e740 DEBUG systemctl: enabled
2025-03-26 13:15:21,472 7f93b227e740 DEBUG systemctl: active
2025-03-26 13:15:21,504 7f93b227e740 DEBUG /usr/bin/podman: 51b1d190dfb9a1db73b8efda020c54df4c339abce8973b8e0d6de2a2b780aa09,quay.io/ceph/ceph@sha256:6ac7f923aa1d23b43248ce0ddec7e1388855ee3d00813b52c31 72b0b23b37906,2bc0b0f4375ddf4270a9a865dfd4e53063acc8e6c3afd7a2546507cafd2ec86a,2025-03-25 16:52:31.726614643 -0400 EDT,
2025-03-26 13:15:21,516 7f93b227e740 DEBUG systemctl: enabled
2025-03-26 13:15:21,524 7f93b227e740 DEBUG systemctl: active
2025-03-26 13:15:21,557 7f93b227e740 DEBUG /usr/bin/podman: dd3203f6f3bb3876ea35d8732c01211bb9cc79bff2258a7d63f425eb00e0221d,quay.io/ceph/ceph@sha256:6ac7f923aa1d23b43248ce0ddec7e1388855ee3d00813b52c31 72b0b23b37906,2bc0b0f4375ddf4270a9a865dfd4e53063acc8e6c3afd7a2546507cafd2ec86a,2025-03-25 16:52:31.898369305 -0400 EDT,
2025-03-26 13:15:21,570 7f93b227e740 DEBUG systemctl: enabled
2025-03-26 13:15:21,579 7f93b227e740 DEBUG systemctl: active
2025-03-26 13:15:21,611 7f93b227e740 DEBUG /usr/bin/podman: 34177c4e5761c9b1e232a7f4a854fa1c8fe187253503265998c9cadd2cb7625c,quay.io/ceph/ceph@sha256:6ac7f923aa1d23b43248ce0ddec7e1388855ee3d00813b52c31 72b0b23b37906,2bc0b0f4375ddf4270a9a865dfd4e53063acc8e6c3afd7a2546507cafd2ec86a,2025-03-25 16:52:33.635739799 -0400 EDT,
2025-03-26 13:15:21,623 7f93b227e740 DEBUG systemctl: enabled
2025-03-26 13:15:21,632 7f93b227e740 DEBUG systemctl: active
2025-03-26 13:15:21,662 7f93b227e740 DEBUG /usr/bin/podman: 65529d6ad1ac3c639ef699c2eed01b6a440e27925d5bccd5fb0eef50b283dab3,quay.io/ceph/ceph@sha256:6ac7f923aa1d23b43248ce0ddec7e1388855ee3d00813b52c31 72b0b23b37906,2bc0b0f4375ddf4270a9a865dfd4e53063acc8e6c3afd7a2546507cafd2ec86a,2025-03-25 16:52:31.726574789 -0400 EDT, 2025-03-26 13:16:14,190 7fa738df4740 DEBUG --------------------------------------------------------------------------------
cephadm ['--no-container-init', '--timeout', '895', 'gather-facts']
2025-03-26 13:17:15,057 7f906b406740 DEBUG --------------------------------------------------------------------------------
cephadm ['--no-container-init', '--timeout', '895', 'gather-facts']
2025-03-26 13:18:15,951 7f3141a37740 DEBUG -------------------------------------------------------------------------------- cephadm ['--image', 'quay.io/ceph/ceph@sha256:6ac7f923aa1d23b43248ce0ddec7e1388855ee3d00813b52c3172b0b23b37906', '--no-container-init', '--timeout', '895', 'ls'] 2025-03-26 13:18:17,047 7feb94c28740 DEBUG --------------------------------------------------------------------------------
cephadm ['--no-container-init', '--timeout', '895', 'gather-facts']
2025-03-26 13:19:18,797 7f23b641f740 DEBUG --------------------------------------------------------------------------------
cephadm ['--no-container-init', '--timeout', '895', 'gather-facts']
2025-03-26 13:20:19,681 7f270b666740 DEBUG --------------------------------------------------------------------------------
cephadm ['--no-container-init', '--timeout', '895', 'gather-facts']
2025-03-26 13:21:20,566 7fcd77be8740 DEBUG --------------------------------------------------------------------------------
cephadm ['--no-container-init', '--timeout', '895', 'check-host']
2025-03-26 13:21:20,595 7fcd77be8740 INFO podman (/usr/bin/podman) version 5.2.2 is present
2025-03-26 13:21:20,595 7fcd77be8740 INFO systemctl is present
2025-03-26 13:21:20,596 7fcd77be8740 INFO lvcreate is present
2025-03-26 13:21:20,635 7fcd77be8740 INFO Unit chronyd.service is enabled and running
2025-03-26 13:21:20,635 7fcd77be8740 INFO Host looks OK
2025-03-26 13:21:21,016 7f670722d740 DEBUG --------------------------------------------------------------------------------
cephadm ['--no-container-init', '--timeout', '895', 'gather-facts']
2025-03-26 13:22:21,860 7f29c27cb740 DEBUG --------------------------------------------------------------------------------
cephadm ['--no-container-init', '--timeout', '895', 'gather-facts']
2025-03-26 13:23:23,116 7f41cdc5d740 DEBUG --------------------------------------------------------------------------------
:
plus more of the same.

The mgr log for dell02 isn't very exciting except for frequent exceptions where the dashboard cannot contact prometheus.

Is there a place I could post complete files without filling up the mailing list?

On 3/26/25 13:23, Eugen Block wrote:
Ok, I'll try one last time and ask for cephadm.log output. ;-) And the active MGR's log might help here as well.

Zitat von Tim Holloway <t...@mousetech.com>:

No change.

On 3/26/25 13:01, Tim Holloway wrote:
It's strange, but for a while I'd been trying to get prometheus working on ceph08, so I don't know.

All I do know is immediately after editing the proxy settings I got indications that those 2 OSDs had gone down.

What's REALLY strange is that their logs seem to hint that somehow they shifted from administered to legacy configuration. That is, looking for OSD resources under /var/lib/ceph instead of /var/lib/ceph/<fsid>.

Anyway, I'll try yanking and re-deploying prometheus and maybe that will magically cure something.

On 3/26/25 12:53, Eugen Block wrote:
Right, systemctl edit works as well. But I'm confused about the down OSDs. Did you set the proxy on all hosts? Because the down OSDs are on ceph06 while prometheus is supposed to run on dell02. Are you sure those are related?

I would recommend to remove the prometheus service entirely and start from scratch:

ceph orch rm prometheus
ceph mgr module disable prometheus
ceph mgr fail

Wait a minute, then enable it again and deploy prometheus:

ceph orch apply -i prometheus.yaml
ceph mgr module enable prometheus



Zitat von Tim Holloway <t...@mousetech.com>:

Since the containers are all podman, I found a "systemctl edit podman" command that's recommended to set proxy for that.

However, once I did, 2 OSDs went down and cannot be restarted.

In any event, before I did that, ceph health detail was returning "HEALTH OK".

Now I'm getting this:

HEALTH_ERR 2 failed cephadm daemon(s); Module 'prometheus' has failed: gaierror(-2, 'Name or service not known'); too many PGs per OSD (865 > max 560)
[WRN] CEPHADM_FAILED_DAEMON: 2 failed cephadm daemon(s)
    daemon osd.3 on ceph06.internal.mousetech.com is in error state
    daemon osd.2 on ceph08.internal.mousetech.com is in error state
[ERR] MGR_MODULE_ERROR: Module 'prometheus' has failed: gaierror(-2, 'Name or service not known')     Module 'prometheus' has failed: gaierror(-2, 'Name or service not known')
[WRN] TOO_MANY_PGS: too many PGs per OSD (865 > max 560)

On 3/26/25 12:07, Eugen Block wrote:
If you need a proxy to pull the images, I suggest to set it in the containers.conf:

cat /etc/containers/containers.conf
[engine]
env = ["http_proxy=<host>:<port>", "https_proxy=<host>:<port>", "no_proxy=<your_no_proxy_list>"]

But again, you should be able to see a failed to pull in the cephadm.log on dell02. Or even in 'ceph health detail', usually it warns you if the orchestrator failed to place a daemon.

Zitat von Tim Holloway <t...@mousetech.com>:

One thing I did run into when upgrading was TLS issues pulling images. I had to set HTTP/S_PROXY and pull manually.

That may relate to this:

025-03-26T10:52:16.547985+0000 mgr.dell02.zwnrme (mgr.18015288) 23874 : cephadm [INF] Saving service prometheus spec with placement dell02.mousetech.com 2025-03-26T10:52:16.560810+0000 mgr.dell02.zwnrme (mgr.18015288) 23875 : cephadm [INF] Saving service node-exporter spec with placement * 2025-03-26T10:52:16.572380+0000 mgr.dell02.zwnrme (mgr.18015288) 23876 : cephadm [INF] Saving service alertmanager spec with placement dell02.mousetech.com 2025-03-26T10:52:16.583555+0000 mgr.dell02.zwnrme (mgr.18015288) 23878 : cephadm [INF] Saving service grafana spec with placement dell02.mousetech.com 2025-03-26T10:52:16.601713+0000 mgr.dell02.zwnrme (mgr.18015288) 23879 : cephadm [INF] Saving service ceph-exporter spec with placement * 2025-03-26T10:52:44.139886+0000 mgr.dell02.zwnrme (mgr.18015288) 23898 : cephadm [INF] Restart service mgr 2025-03-26T10:53:02.720157+0000 mgr.ceph08.tlocfi (mgr.18043792) 7 : cephadm [INF] [26/Mar/2025:10:53:02] ENGINE Bus STARTING 2025-03-26T10:53:02.824138+0000 mgr.ceph08.tlocfi (mgr.18043792) 8 : cephadm [INF] [26/Mar/2025:10:53:02] ENGINE Serving on http://10.0.1.58:8765 2025-03-26T10:53:02.962314+0000 mgr.ceph08.tlocfi (mgr.18043792) 9 : cephadm [INF] [26/Mar/2025:10:53:02] ENGINE Serving on https://10.0.1.58:7150 2025-03-26T10:53:02.962805+0000 mgr.ceph08.tlocfi (mgr.18043792) 10 : cephadm [INF] [26/Mar/2025:10:53:02] ENGINE Bus STARTED 2025-03-26T10:53:02.964966+0000 mgr.ceph08.tlocfi (mgr.18043792) 11 : cephadm [ERR] [26/Mar/2025:10:53:02] ENGINE Error in HTTPServer.serve
Traceback (most recent call last):
  File "/lib/python3.9/site-packages/cheroot/server.py", line 1823, in serve
self._connections.run(self.expiration_interval)
  File "/lib/python3.9/site-packages/cheroot/connections.py", line 203, in run
    self._run(expiration_interval)
  File "/lib/python3.9/site-packages/cheroot/connections.py", line 246, in _run
    new_conn = self._from_server_socket(self.server.socket)
  File "/lib/python3.9/site-packages/cheroot/connections.py", line 300, in _from_server_socket
    s, ssl_env = self.server.ssl_adapter.wrap(s)
  File "/lib/python3.9/site-packages/cheroot/ssl/builtin.py", line 277, in wrap
    s = self.context.wrap_socket(
  File "/lib64/python3.9/ssl.py", line 501, in wrap_socket
    return self.sslsocket_class._create(
  File "/lib64/python3.9/ssl.py", line 1074, in _create
    self.do_handshake()
  File "/lib64/python3.9/ssl.py", line 1343, in do_handshake
    self._sslobj.do_handshake()
ssl.SSLZeroReturnError: TLS/SSL connection has been closed (EOF) (_ssl.c:1133)

2025-03-26T10:53:03.471114+0000 mgr.ceph08.tlocfi (mgr.18043792) 12 : cephadm [INF] Updating ceph03.internal.mousetech.com:/etc/ceph/ceph.conf

On 3/26/25 11:39, Eugen Block wrote:
Then maybe the deployment did fail and we’re back at looking into the cephadm.log.


Zitat von Tim Holloway <t...@mousetech.com>:

it returns nothing. I'd already done the same via "systemctl | grep prometheus". There simply isn't a systemd service, even though there should be.

On 3/26/25 11:31, Eugen Block wrote:
There’s a service called „prometheus“, which can have multiple daemons, just like any other service (mon, mgr etc). To get the daemon logs you need to provide the daemon name (prometheus.ceph02.andsopn), not just the service name (prometheus).

Can you run the cephadm command I provided? It should show something like I pasted in the previous message.

Zitat von Tim Holloway <t...@mousetech.com>:

service_type: prometheus
service_name: prometheus
placement:
  hosts:
  - dell02.mousetech.com
networks:
- 10.0.1.0/24

Can't list daemon logs, run restart usw., because "Error EINVAL: No daemons exist under service name "prometheus". View currently running services using "ceph orch ls""

And yet, ceph orch ls shows prometheus as a service.

On 3/26/25 11:13, Eugen Block wrote:
ceph orch ls prometheus --export
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to