[ceph-users] Re: Ceph orch placement - anti affinity

2025-03-26 Thread Eugen Block
If you don't specify "count_per_host", the orchestrator won't deploy  
multiple daemons on one host. There's no way (that I'm aware of) to  
specify a primary daemon. Since standby daemons need to be able to  
take over the workload, they should be all equally equipped.


Zitat von Kasper Rasmussen :


Let’s say I have 2 cephfs, and three hosts I want to use as MDS hosts.

I use ceph orch apply mds to spin up the MDS daemons.

Is there a way to ensure that I don’t get two active MDS running on  
the same host?


I mean when using the ceph orch apply mds command, I can specify  
—placement, but it only seems like I can define the hosts, but not  
let’s say a primary host or something similar.


Anyone with knowledge of how to do this or is it simply not possible?
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Prometheus anomaly in Reef

2025-03-26 Thread Tim Holloway

service_type: prometheus
service_name: prometheus
placement:
  hosts:
  - dell02.mousetech.com
networks:
- 10.0.1.0/24

Can't list daemon logs, run restart usw., because "Error EINVAL: No 
daemons exist under service name "prometheus". View currently running 
services using "ceph orch ls""


And yet, ceph orch ls shows prometheus as a service.

On 3/26/25 11:13, Eugen Block wrote:

ceph orch ls prometheus --export

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Prometheus anomaly in Reef

2025-03-26 Thread Eugen Block
Then maybe the deployment did fail and we’re back at looking into the  
cephadm.log.



Zitat von Tim Holloway :

it returns nothing. I'd already done the same via "systemctl | grep  
prometheus". There simply isn't a systemd service, even though there  
should be.


On 3/26/25 11:31, Eugen Block wrote:
There’s a service called „prometheus“, which can have multiple  
daemons, just like any other service (mon, mgr etc). To get the  
daemon logs you need to provide the daemon name  
(prometheus.ceph02.andsopn), not just the service name (prometheus).


Can you run the cephadm command I provided? It should show  
something like I pasted in the previous message.


Zitat von Tim Holloway :


service_type: prometheus
service_name: prometheus
placement:
  hosts:
  - dell02.mousetech.com
networks:
- 10.0.1.0/24

Can't list daemon logs, run restart usw., because "Error EINVAL:  
No daemons exist under service name "prometheus". View currently  
running services using "ceph orch ls""


And yet, ceph orch ls shows prometheus as a service.

On 3/26/25 11:13, Eugen Block wrote:

ceph orch ls prometheus --export

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Prometheus anomaly in Reef

2025-03-26 Thread Eugen Block
If you need a proxy to pull the images, I suggest to set it in the  
containers.conf:


cat /etc/containers/containers.conf
[engine]
env = ["http_proxy=:", "https_proxy=:",  
"no_proxy="]


But again, you should be able to see a failed to pull in the  
cephadm.log on dell02. Or even in 'ceph health detail', usually it  
warns you if the orchestrator failed to place a daemon.


Zitat von Tim Holloway :

One thing I did run into when upgrading was TLS issues pulling  
images. I had to set HTTP/S_PROXY and pull manually.


That may relate to this:

025-03-26T10:52:16.547985+ mgr.dell02.zwnrme (mgr.18015288)  
23874 : cephadm [INF] Saving service prometheus spec with placement  
dell02.mousetech.com
2025-03-26T10:52:16.560810+ mgr.dell02.zwnrme (mgr.18015288)  
23875 : cephadm [INF] Saving service node-exporter spec with  
placement *
2025-03-26T10:52:16.572380+ mgr.dell02.zwnrme (mgr.18015288)  
23876 : cephadm [INF] Saving service alertmanager spec with  
placement dell02.mousetech.com
2025-03-26T10:52:16.583555+ mgr.dell02.zwnrme (mgr.18015288)  
23878 : cephadm [INF] Saving service grafana spec with placement  
dell02.mousetech.com
2025-03-26T10:52:16.601713+ mgr.dell02.zwnrme (mgr.18015288)  
23879 : cephadm [INF] Saving service ceph-exporter spec with  
placement *
2025-03-26T10:52:44.139886+ mgr.dell02.zwnrme (mgr.18015288)  
23898 : cephadm [INF] Restart service mgr
2025-03-26T10:53:02.720157+ mgr.ceph08.tlocfi (mgr.18043792) 7 :  
cephadm [INF] [26/Mar/2025:10:53:02] ENGINE Bus STARTING
2025-03-26T10:53:02.824138+ mgr.ceph08.tlocfi (mgr.18043792) 8 :  
cephadm [INF] [26/Mar/2025:10:53:02] ENGINE Serving on  
http://10.0.1.58:8765
2025-03-26T10:53:02.962314+ mgr.ceph08.tlocfi (mgr.18043792) 9 :  
cephadm [INF] [26/Mar/2025:10:53:02] ENGINE Serving on  
https://10.0.1.58:7150
2025-03-26T10:53:02.962805+ mgr.ceph08.tlocfi (mgr.18043792) 10  
: cephadm [INF] [26/Mar/2025:10:53:02] ENGINE Bus STARTED
2025-03-26T10:53:02.964966+ mgr.ceph08.tlocfi (mgr.18043792) 11  
: cephadm [ERR] [26/Mar/2025:10:53:02] ENGINE Error in  
HTTPServer.serve

Traceback (most recent call last):
  File "/lib/python3.9/site-packages/cheroot/server.py", line 1823, in serve
    self._connections.run(self.expiration_interval)
  File "/lib/python3.9/site-packages/cheroot/connections.py", line  
203, in run

    self._run(expiration_interval)
  File "/lib/python3.9/site-packages/cheroot/connections.py", line  
246, in _run

    new_conn = self._from_server_socket(self.server.socket)
  File "/lib/python3.9/site-packages/cheroot/connections.py", line  
300, in _from_server_socket

    s, ssl_env = self.server.ssl_adapter.wrap(s)
  File "/lib/python3.9/site-packages/cheroot/ssl/builtin.py", line  
277, in wrap

    s = self.context.wrap_socket(
  File "/lib64/python3.9/ssl.py", line 501, in wrap_socket
    return self.sslsocket_class._create(
  File "/lib64/python3.9/ssl.py", line 1074, in _create
    self.do_handshake()
  File "/lib64/python3.9/ssl.py", line 1343, in do_handshake
    self._sslobj.do_handshake()
ssl.SSLZeroReturnError: TLS/SSL connection has been closed (EOF)  
(_ssl.c:1133)


2025-03-26T10:53:03.471114+ mgr.ceph08.tlocfi (mgr.18043792) 12  
: cephadm [INF] Updating  
ceph03.internal.mousetech.com:/etc/ceph/ceph.conf


On 3/26/25 11:39, Eugen Block wrote:
Then maybe the deployment did fail and we’re back at looking into  
the cephadm.log.



Zitat von Tim Holloway :

it returns nothing. I'd already done the same via "systemctl |  
grep prometheus". There simply isn't a systemd service, even  
though there should be.


On 3/26/25 11:31, Eugen Block wrote:
There’s a service called „prometheus“, which can have multiple  
daemons, just like any other service (mon, mgr etc). To get the  
daemon logs you need to provide the daemon name  
(prometheus.ceph02.andsopn), not just the service name  
(prometheus).


Can you run the cephadm command I provided? It should show  
something like I pasted in the previous message.


Zitat von Tim Holloway :


service_type: prometheus
service_name: prometheus
placement:
  hosts:
  - dell02.mousetech.com
networks:
- 10.0.1.0/24

Can't list daemon logs, run restart usw., because "Error EINVAL:  
No daemons exist under service name "prometheus". View currently  
running services using "ceph orch ls""


And yet, ceph orch ls shows prometheus as a service.

On 3/26/25 11:13, Eugen Block wrote:

ceph orch ls prometheus --export

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list

[ceph-users] Re: Prometheus anomaly in Reef

2025-03-26 Thread Eugen Block
That would be the correct log file, but I don't see an attempt to  
deploy a prometheus instance there. You can use any pastebin you like,  
e. g. https://pastebin.com/ to upload your logs. Mask any sensitive  
data before you do that.



Zitat von Tim Holloway :

Well, here's an excerpt from the /var/log/ceph/cephadm.log. I don't  
know if that's the mechanism or file you mean, though.



2025-03-26 13:11:09,382 7fb2abc38740 DEBUG  


cephadm ['--no-container-init', '--timeout', '895', 'gather-facts']
2025-03-26 13:12:10,219 7fc4fd405740 DEBUG  


cephadm ['--no-container-init', '--timeout', '895', 'gather-facts']
2025-03-26 13:13:11,502 7f2ef3c76740 DEBUG  


cephadm ['--no-container-init', '--timeout', '895', 'gather-facts']
2025-03-26 13:14:12,372 7f3566bef740 DEBUG  


cephadm ['--no-container-init', '--timeout', '895', 'gather-facts']
2025-03-26 13:15:13,301 7f660e204740 DEBUG  


cephadm ['--no-container-init', '--timeout', '895', 'gather-facts']
2025-03-26 13:15:20,880 7f93b227e740 DEBUG  


cephadm ['ls']
2025-03-26 13:15:20,904 7f93b227e740 DEBUG /usr/bin/podman: 5.2.2
2025-03-26 13:15:20,939 7f93b227e740 DEBUG /usr/bin/podman:  
2149e16fa2ce,11.51MB / 33.24GB
2025-03-26 13:15:20,939 7f93b227e740 DEBUG /usr/bin/podman:  
65529d6ad1ac,17.69MB / 33.24GB
2025-03-26 13:15:20,939 7f93b227e740 DEBUG /usr/bin/podman:  
51b1d190dfb9,99.79MB / 33.24GB
2025-03-26 13:15:20,939 7f93b227e740 DEBUG /usr/bin/podman:  
59a865e3bcc5,6.791MB / 33.24GB
2025-03-26 13:15:20,939 7f93b227e740 DEBUG /usr/bin/podman:  
dd3203f6f3bb,410.2MB / 33.24GB
2025-03-26 13:15:20,939 7f93b227e740 DEBUG /usr/bin/podman:  
34177c4e5761,1.764GB / 33.24GB
2025-03-26 13:15:20,939 7f93b227e740 DEBUG /usr/bin/podman:  
bfe17e83b288,534.2MB / 33.24GB
2025-03-26 13:15:20,972 7f93b227e740 DEBUG /usr/bin/podman:  
2149e16fa2ce,0.00%
2025-03-26 13:15:20,972 7f93b227e740 DEBUG /usr/bin/podman:  
65529d6ad1ac,0.26%
2025-03-26 13:15:20,972 7f93b227e740 DEBUG /usr/bin/podman:  
51b1d190dfb9,0.22%
2025-03-26 13:15:20,972 7f93b227e740 DEBUG /usr/bin/podman:  
59a865e3bcc5,0.02%
2025-03-26 13:15:20,972 7f93b227e740 DEBUG /usr/bin/podman:  
dd3203f6f3bb,0.86%
2025-03-26 13:15:20,972 7f93b227e740 DEBUG /usr/bin/podman:  
34177c4e5761,1.67%
2025-03-26 13:15:20,972 7f93b227e740 DEBUG /usr/bin/podman:  
bfe17e83b288,0.25%

2025-03-26 13:15:20,985 7f93b227e740 DEBUG systemctl: enabled
2025-03-26 13:15:20,993 7f93b227e740 DEBUG systemctl: active
2025-03-26 13:15:21,024 7f93b227e740 DEBUG /usr/bin/podman:  
2149e16fa2ce8769bf3be9e6e25eec61b8e027b0e8699f1cb7d5f113fc4aac66,quay.io/prometheus/node-exporter:v1.5.0,0da6a335fe1356545476b749c68f022c897d

e3a2139e8f0054f6937349ee2b83,2025-03-25 16:52:31.644234532 -0400 EDT,
2025-03-26 13:15:21,057 7f93b227e740 DEBUG /usr/bin/podman:  
[quay.io/prometheus/node-exporter@sha256:39c642b2b337e38c18e80266fb14383754178202f40103646337722a594d984c  
quay.io/prometheus/node-exporter@sh

a256:fa8e5700b7762fffe0674e944762f44bb787a7e44d97569fe55348260453bf80]
2025-03-26 13:15:21,111 7f93b227e740 DEBUG /usr/bin/podman:  
node_exporter, version 1.5.0 (branch: HEAD, revision:  
1b48970ffcf5630534fb00bb0687d73c66d1c959)
2025-03-26 13:15:21,111 7f93b227e740 DEBUG /usr/bin/podman: build  
user:   root@6e7732a7b81b
2025-03-26 13:15:21,111 7f93b227e740 DEBUG /usr/bin/podman: build  
date:   20221129-18:59:09
2025-03-26 13:15:21,111 7f93b227e740 DEBUG /usr/bin/podman:   go  
version:   go1.19.3
2025-03-26 13:15:21,111 7f93b227e740 DEBUG /usr/bin/podman:  
platform: linux/amd64

2025-03-26 13:15:21,187 7f93b227e740 DEBUG systemctl: enabled
2025-03-26 13:15:21,196 7f93b227e740 DEBUG systemctl: active
2025-03-26 13:15:21,228 7f93b227e740 DEBUG /usr/bin/podman:  
59a865e3bcc5e86f6caed8278aec0cfed608bf89ff4953dfb48b762138955925,quay.io/ceph/ceph@sha256:6ac7f923aa1d23b43248ce0ddec7e1388855ee3d00813b52c31
72b0b23b37906,2bc0b0f4375ddf4270a9a865dfd4e53063acc8e6c3afd7a2546507cafd2ec86a,2025-03-25 16:52:31.731849052 -0400  
EDT,
2025-03-26 13:15:21,260 7f93b227e740 DEBUG /usr/bin/podman:  
[quay.io/ceph/ceph@sha256:6ac7f923aa1d23b43248ce0ddec7e1388855ee3d00813b52c3172b0b23b37906  
quay.io/ceph/ceph@sha256:ac06cdca6f2512a763f1ace85

53330e454152b82f95a2b6bf33c3f3ec2eeac77]
2025-03-26 13:15:21,385 7f93b227e740 DEBUG /usr/bin/podman: ceph  
version 18.2.4 (e7ad5345525c7aa95470c26863873b581076945d) reef  
(stable)

:2025-03-26 13:15:21,412 7f93b227e740 DEBUG systemctl: enabled
2025-03-26 13:15:21,421 7f93b227e740 DEBUG systemctl: active
2025-03-26 13:15:21,451 7

[ceph-users] Re: Prometheus anomaly in Reef

2025-03-26 Thread Tim Holloway
I don't think there is failure to deploy. For one thing, I did have, as 
mentioned 3 Prometheus-related containers running at one point on the 
machine. Also checked for port issues and there are none. Nothing 
listens on 9095.


One thing that does concern me is that the docs sau changes in settings 
require "restarting prometheus", but not what command does that. Given 
that there are no systemd units to address and that the orchestrator 
claims that there is no "prometheus service" even as it shows that 
there's 1 service, stopped, it's quite frustrating.


On 3/26/25 07:26, Eugen Block wrote:
The cephadm.log should show some details why it fails to deploy the 
daemon. If there's not much, look into the daemon logs as well 
(cephadm logs --name prometheus.ceph02.mousetech.com). Could it be 
that there's a non-cephadm prometheus already listening on port 9095?


Zitat von Tim Holloway :

I finally got brave and migrated from Pacific to Reef, did some 
banging and hammering and for the first time in a long time got a 
complete "HEALTH OK" status.


However, the dashboard is still not happy. It cannot contact the 
Prometheus API on port 9095.


I have redeployed Prometheus multiple times without result.

I'm pretty sure that at one time there were no less than 3 different 
Prometheus containers running on one of the configured Prometheus 
servers, but now all I can get is the node-exporter.


Worse, if I do:

ceph orch reconfig prometheus

I get:

Error EINVAL: No daemons exist under service name "prometheus". View 
currently running services using "ceph orch ls"


But if I do:

ceph orch ls

I get:

prometheus ?:9095   0/1 -  
116s  ceph02.mousetech.com


Suggestions?

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Prometheus anomaly in Reef

2025-03-26 Thread Eugen Block
There’s a service called „prometheus“, which can have multiple  
daemons, just like any other service (mon, mgr etc). To get the daemon  
logs you need to provide the daemon name (prometheus.ceph02.andsopn),  
not just the service name (prometheus).


Can you run the cephadm command I provided? It should show something  
like I pasted in the previous message.


Zitat von Tim Holloway :


service_type: prometheus
service_name: prometheus
placement:
  hosts:
  - dell02.mousetech.com
networks:
- 10.0.1.0/24

Can't list daemon logs, run restart usw., because "Error EINVAL: No  
daemons exist under service name "prometheus". View currently  
running services using "ceph orch ls""


And yet, ceph orch ls shows prometheus as a service.

On 3/26/25 11:13, Eugen Block wrote:

ceph orch ls prometheus --export

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Prometheus anomaly in Reef

2025-03-26 Thread Tim Holloway
Since the containers are all podman, I found a "systemctl edit podman" 
command that's recommended to set proxy for that.


However, once I did, 2 OSDs went down and cannot be restarted.

In any event, before I did that, ceph health detail was returning 
"HEALTH OK".


Now I'm getting this:

HEALTH_ERR 2 failed cephadm daemon(s); Module 'prometheus' has failed: 
gaierror(-2, 'Name or service not known'); too many PGs per OSD (865 > 
max 560)

[WRN] CEPHADM_FAILED_DAEMON: 2 failed cephadm daemon(s)
    daemon osd.3 on ceph06.internal.mousetech.com is in error state
    daemon osd.2 on ceph08.internal.mousetech.com is in error state
[ERR] MGR_MODULE_ERROR: Module 'prometheus' has failed: gaierror(-2, 
'Name or service not known')
    Module 'prometheus' has failed: gaierror(-2, 'Name or service not 
known')

[WRN] TOO_MANY_PGS: too many PGs per OSD (865 > max 560)

On 3/26/25 12:07, Eugen Block wrote:
If you need a proxy to pull the images, I suggest to set it in the 
containers.conf:


cat /etc/containers/containers.conf
[engine]
env = ["http_proxy=:", "https_proxy=:", 
"no_proxy="]


But again, you should be able to see a failed to pull in the 
cephadm.log on dell02. Or even in 'ceph health detail', usually it 
warns you if the orchestrator failed to place a daemon.


Zitat von Tim Holloway :

One thing I did run into when upgrading was TLS issues pulling 
images. I had to set HTTP/S_PROXY and pull manually.


That may relate to this:

025-03-26T10:52:16.547985+ mgr.dell02.zwnrme (mgr.18015288) 23874 
: cephadm [INF] Saving service prometheus spec with placement 
dell02.mousetech.com
2025-03-26T10:52:16.560810+ mgr.dell02.zwnrme (mgr.18015288) 
23875 : cephadm [INF] Saving service node-exporter spec with placement *
2025-03-26T10:52:16.572380+ mgr.dell02.zwnrme (mgr.18015288) 
23876 : cephadm [INF] Saving service alertmanager spec with placement 
dell02.mousetech.com
2025-03-26T10:52:16.583555+ mgr.dell02.zwnrme (mgr.18015288) 
23878 : cephadm [INF] Saving service grafana spec with placement 
dell02.mousetech.com
2025-03-26T10:52:16.601713+ mgr.dell02.zwnrme (mgr.18015288) 
23879 : cephadm [INF] Saving service ceph-exporter spec with placement *
2025-03-26T10:52:44.139886+ mgr.dell02.zwnrme (mgr.18015288) 
23898 : cephadm [INF] Restart service mgr
2025-03-26T10:53:02.720157+ mgr.ceph08.tlocfi (mgr.18043792) 7 : 
cephadm [INF] [26/Mar/2025:10:53:02] ENGINE Bus STARTING
2025-03-26T10:53:02.824138+ mgr.ceph08.tlocfi (mgr.18043792) 8 : 
cephadm [INF] [26/Mar/2025:10:53:02] ENGINE Serving on 
http://10.0.1.58:8765
2025-03-26T10:53:02.962314+ mgr.ceph08.tlocfi (mgr.18043792) 9 : 
cephadm [INF] [26/Mar/2025:10:53:02] ENGINE Serving on 
https://10.0.1.58:7150
2025-03-26T10:53:02.962805+ mgr.ceph08.tlocfi (mgr.18043792) 10 : 
cephadm [INF] [26/Mar/2025:10:53:02] ENGINE Bus STARTED
2025-03-26T10:53:02.964966+ mgr.ceph08.tlocfi (mgr.18043792) 11 : 
cephadm [ERR] [26/Mar/2025:10:53:02] ENGINE Error in HTTPServer.serve

Traceback (most recent call last):
  File "/lib/python3.9/site-packages/cheroot/server.py", line 1823, 
in serve

    self._connections.run(self.expiration_interval)
  File "/lib/python3.9/site-packages/cheroot/connections.py", line 
203, in run

    self._run(expiration_interval)
  File "/lib/python3.9/site-packages/cheroot/connections.py", line 
246, in _run

    new_conn = self._from_server_socket(self.server.socket)
  File "/lib/python3.9/site-packages/cheroot/connections.py", line 
300, in _from_server_socket

    s, ssl_env = self.server.ssl_adapter.wrap(s)
  File "/lib/python3.9/site-packages/cheroot/ssl/builtin.py", line 
277, in wrap

    s = self.context.wrap_socket(
  File "/lib64/python3.9/ssl.py", line 501, in wrap_socket
    return self.sslsocket_class._create(
  File "/lib64/python3.9/ssl.py", line 1074, in _create
    self.do_handshake()
  File "/lib64/python3.9/ssl.py", line 1343, in do_handshake
    self._sslobj.do_handshake()
ssl.SSLZeroReturnError: TLS/SSL connection has been closed (EOF) 
(_ssl.c:1133)


2025-03-26T10:53:03.471114+ mgr.ceph08.tlocfi (mgr.18043792) 12 : 
cephadm [INF] Updating ceph03.internal.mousetech.com:/etc/ceph/ceph.conf


On 3/26/25 11:39, Eugen Block wrote:
Then maybe the deployment did fail and we’re back at looking into 
the cephadm.log.



Zitat von Tim Holloway :

it returns nothing. I'd already done the same via "systemctl | grep 
prometheus". There simply isn't a systemd service, even though 
there should be.


On 3/26/25 11:31, Eugen Block wrote:
There’s a service called „prometheus“, which can have multiple 
daemons, just like any other service (mon, mgr etc). To get the 
daemon logs you need to provide the daemon name 
(prometheus.ceph02.andsopn), not just the service name (prometheus).


Can you run the cephadm command I provided? It should show 
something like I pasted in the previous message.


Zitat von Tim Holloway :


service_type: prometheus
service_name: prometheus
placement:
  hosts:
  - dell02.

[ceph-users] Re: Prometheus anomaly in Reef

2025-03-26 Thread Tim Holloway

Also, here are the currently-installed container images:

[root@dell02 ~]# podman image ls
REPOSITORY    TAG IMAGE ID CREATED    SIZE
quay.io/ceph/ceph   2bc0b0f4375d 8 months 
ago   1.25 GB
quay.io/ceph/ceph   3c4eff6082ae 10 months 
ago  1.22 GB
quay.io/ceph/ceph-grafana 9.4.7   954c08fa6188  15 months 
ago  647 MB
quay.io/prometheus/alertmanager   v0.25.0 c8568f914cd2  2 years 
ago    66.5 MB
quay.io/prometheus/node-exporter  v1.5.0  0da6a335fe13  2 years 
ago    23.9 MB
quay.io/ceph/ceph-grafana 8.3.5   dad864ee21e9  2 years 
ago    571 MB
quay.io/prometheus/node-exporter  v1.3.1  1dbe0e931976  3 years 
ago    22.3 MB
quay.io/prometheus/alertmanager   v0.23.0 ba2b418f427c  3 years 
ago    58.9 MB



On 3/26/25 11:36, Tim Holloway wrote:
it returns nothing. I'd already done the same via "systemctl | grep 
prometheus". There simply isn't a systemd service, even though there 
should be.


On 3/26/25 11:31, Eugen Block wrote:
There’s a service called „prometheus“, which can have multiple 
daemons, just like any other service (mon, mgr etc). To get the 
daemon logs you need to provide the daemon name 
(prometheus.ceph02.andsopn), not just the service name (prometheus).


Can you run the cephadm command I provided? It should show something 
like I pasted in the previous message.


Zitat von Tim Holloway :


service_type: prometheus
service_name: prometheus
placement:
  hosts:
  - dell02.mousetech.com
networks:
- 10.0.1.0/24

Can't list daemon logs, run restart usw., because "Error EINVAL: No 
daemons exist under service name "prometheus". View currently 
running services using "ceph orch ls""


And yet, ceph orch ls shows prometheus as a service.

On 3/26/25 11:13, Eugen Block wrote:

ceph orch ls prometheus --export

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Ceph orch placement - anti affinity

2025-03-26 Thread Kasper Rasmussen
Let’s say I have 2 cephfs, and three hosts I want to use as MDS hosts.

I use ceph orch apply mds to spin up the MDS daemons.

Is there a way to ensure that I don’t get two active MDS running on the same 
host?

I mean when using the ceph orch apply mds command, I can specify —placement, 
but it only seems like I can define the hosts, but not let’s say a primary host 
or something similar.

Anyone with knowledge of how to do this or is it simply not possible?
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Prometheus anomaly in Reef

2025-03-26 Thread Eugen Block
The cephadm.log should show some details why it fails to deploy the  
daemon. If there's not much, look into the daemon logs as well  
(cephadm logs --name prometheus.ceph02.mousetech.com). Could it be  
that there's a non-cephadm prometheus already listening on port 9095?


Zitat von Tim Holloway :

I finally got brave and migrated from Pacific to Reef, did some  
banging and hammering and for the first time in a long time got a  
complete "HEALTH OK" status.


However, the dashboard is still not happy. It cannot contact the  
Prometheus API on port 9095.


I have redeployed Prometheus multiple times without result.

I'm pretty sure that at one time there were no less than 3 different  
Prometheus containers running on one of the configured Prometheus  
servers, but now all I can get is the node-exporter.


Worse, if I do:

ceph orch reconfig prometheus

I get:

Error EINVAL: No daemons exist under service name "prometheus". View  
currently running services using "ceph orch ls"


But if I do:

ceph orch ls

I get:

prometheus ?:9095   0/1 -   
116s  ceph02.mousetech.com


Suggestions?

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Prometheus anomaly in Reef

2025-03-26 Thread Eugen Block
Can you share 'ceph orch ls prometheus --export'? And if it has been  
deployed successfully but is currently not running, the logs should  
show why that is the case.


To restart prometheus, you can just run this to restart the entire  
prometheus service (which would include all instances if you had  
multiple, but currently you only have one in the specs):


ceph orch restart prometheus

or only a specific daemon:

ceph orch daemon restart prometheus.

And usually, cephadm does create systemd units, for example:

cephadm ls --no-detail | grep prometheus
"name": "prometheus.nautilus",
"systemd_unit":  
"ceph-201a2fbc-ce7b-44a3-9ed7-39427972083b@prometheus.nautilus"



Zitat von Tim Holloway :

I don't think there is failure to deploy. For one thing, I did have,  
as mentioned 3 Prometheus-related containers running at one point on  
the machine. Also checked for port issues and there are none.  
Nothing listens on 9095.


One thing that does concern me is that the docs sau changes in  
settings require "restarting prometheus", but not what command does  
that. Given that there are no systemd units to address and that the  
orchestrator claims that there is no "prometheus service" even as it  
shows that there's 1 service, stopped, it's quite frustrating.


On 3/26/25 07:26, Eugen Block wrote:
The cephadm.log should show some details why it fails to deploy the  
daemon. If there's not much, look into the daemon logs as well  
(cephadm logs --name prometheus.ceph02.mousetech.com). Could it be  
that there's a non-cephadm prometheus already listening on port 9095?


Zitat von Tim Holloway :

I finally got brave and migrated from Pacific to Reef, did some  
banging and hammering and for the first time in a long time got a  
complete "HEALTH OK" status.


However, the dashboard is still not happy. It cannot contact the  
Prometheus API on port 9095.


I have redeployed Prometheus multiple times without result.

I'm pretty sure that at one time there were no less than 3  
different Prometheus containers running on one of the configured  
Prometheus servers, but now all I can get is the node-exporter.


Worse, if I do:

ceph orch reconfig prometheus

I get:

Error EINVAL: No daemons exist under service name "prometheus".  
View currently running services using "ceph orch ls"


But if I do:

ceph orch ls

I get:

prometheus ?:9095   0/1 -   
116s  ceph02.mousetech.com


Suggestions?

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Prometheus anomaly in Reef

2025-03-26 Thread Tim Holloway
Well, here's an excerpt from the /var/log/ceph/cephadm.log. I don't know 
if that's the mechanism or file you mean, though.



2025-03-26 13:11:09,382 7fb2abc38740 DEBUG 


cephadm ['--no-container-init', '--timeout', '895', 'gather-facts']
2025-03-26 13:12:10,219 7fc4fd405740 DEBUG 


cephadm ['--no-container-init', '--timeout', '895', 'gather-facts']
2025-03-26 13:13:11,502 7f2ef3c76740 DEBUG 


cephadm ['--no-container-init', '--timeout', '895', 'gather-facts']
2025-03-26 13:14:12,372 7f3566bef740 DEBUG 


cephadm ['--no-container-init', '--timeout', '895', 'gather-facts']
2025-03-26 13:15:13,301 7f660e204740 DEBUG 


cephadm ['--no-container-init', '--timeout', '895', 'gather-facts']
2025-03-26 13:15:20,880 7f93b227e740 DEBUG 


cephadm ['ls']
2025-03-26 13:15:20,904 7f93b227e740 DEBUG /usr/bin/podman: 5.2.2
2025-03-26 13:15:20,939 7f93b227e740 DEBUG /usr/bin/podman: 
2149e16fa2ce,11.51MB / 33.24GB
2025-03-26 13:15:20,939 7f93b227e740 DEBUG /usr/bin/podman: 
65529d6ad1ac,17.69MB / 33.24GB
2025-03-26 13:15:20,939 7f93b227e740 DEBUG /usr/bin/podman: 
51b1d190dfb9,99.79MB / 33.24GB
2025-03-26 13:15:20,939 7f93b227e740 DEBUG /usr/bin/podman: 
59a865e3bcc5,6.791MB / 33.24GB
2025-03-26 13:15:20,939 7f93b227e740 DEBUG /usr/bin/podman: 
dd3203f6f3bb,410.2MB / 33.24GB
2025-03-26 13:15:20,939 7f93b227e740 DEBUG /usr/bin/podman: 
34177c4e5761,1.764GB / 33.24GB
2025-03-26 13:15:20,939 7f93b227e740 DEBUG /usr/bin/podman: 
bfe17e83b288,534.2MB / 33.24GB
2025-03-26 13:15:20,972 7f93b227e740 DEBUG /usr/bin/podman: 
2149e16fa2ce,0.00%
2025-03-26 13:15:20,972 7f93b227e740 DEBUG /usr/bin/podman: 
65529d6ad1ac,0.26%
2025-03-26 13:15:20,972 7f93b227e740 DEBUG /usr/bin/podman: 
51b1d190dfb9,0.22%
2025-03-26 13:15:20,972 7f93b227e740 DEBUG /usr/bin/podman: 
59a865e3bcc5,0.02%
2025-03-26 13:15:20,972 7f93b227e740 DEBUG /usr/bin/podman: 
dd3203f6f3bb,0.86%
2025-03-26 13:15:20,972 7f93b227e740 DEBUG /usr/bin/podman: 
34177c4e5761,1.67%
2025-03-26 13:15:20,972 7f93b227e740 DEBUG /usr/bin/podman: 
bfe17e83b288,0.25%

2025-03-26 13:15:20,985 7f93b227e740 DEBUG systemctl: enabled
2025-03-26 13:15:20,993 7f93b227e740 DEBUG systemctl: active
2025-03-26 13:15:21,024 7f93b227e740 DEBUG /usr/bin/podman: 
2149e16fa2ce8769bf3be9e6e25eec61b8e027b0e8699f1cb7d5f113fc4aac66,quay.io/prometheus/node-exporter:v1.5.0,0da6a335fe1356545476b749c68f022c897d

e3a2139e8f0054f6937349ee2b83,2025-03-25 16:52:31.644234532 -0400 EDT,
2025-03-26 13:15:21,057 7f93b227e740 DEBUG /usr/bin/podman: 
[quay.io/prometheus/node-exporter@sha256:39c642b2b337e38c18e80266fb14383754178202f40103646337722a594d984c 
quay.io/prometheus/node-exporter@sh

a256:fa8e5700b7762fffe0674e944762f44bb787a7e44d97569fe55348260453bf80]
2025-03-26 13:15:21,111 7f93b227e740 DEBUG /usr/bin/podman: 
node_exporter, version 1.5.0 (branch: HEAD, revision: 
1b48970ffcf5630534fb00bb0687d73c66d1c959)
2025-03-26 13:15:21,111 7f93b227e740 DEBUG /usr/bin/podman: build 
user:   root@6e7732a7b81b
2025-03-26 13:15:21,111 7f93b227e740 DEBUG /usr/bin/podman: build 
date:   20221129-18:59:09
2025-03-26 13:15:21,111 7f93b227e740 DEBUG /usr/bin/podman:   go 
version:   go1.19.3
2025-03-26 13:15:21,111 7f93b227e740 DEBUG /usr/bin/podman: 
platform: linux/amd64

2025-03-26 13:15:21,187 7f93b227e740 DEBUG systemctl: enabled
2025-03-26 13:15:21,196 7f93b227e740 DEBUG systemctl: active
2025-03-26 13:15:21,228 7f93b227e740 DEBUG /usr/bin/podman: 
59a865e3bcc5e86f6caed8278aec0cfed608bf89ff4953dfb48b762138955925,quay.io/ceph/ceph@sha256:6ac7f923aa1d23b43248ce0ddec7e1388855ee3d00813b52c31
72b0b23b37906,2bc0b0f4375ddf4270a9a865dfd4e53063acc8e6c3afd7a2546507cafd2ec86a,2025-03-25 
16:52:31.731849052 -0400 EDT,
2025-03-26 13:15:21,260 7f93b227e740 DEBUG /usr/bin/podman: 
[quay.io/ceph/ceph@sha256:6ac7f923aa1d23b43248ce0ddec7e1388855ee3d00813b52c3172b0b23b37906 
quay.io/ceph/ceph@sha256:ac06cdca6f2512a763f1ace85

53330e454152b82f95a2b6bf33c3f3ec2eeac77]
2025-03-26 13:15:21,385 7f93b227e740 DEBUG /usr/bin/podman: ceph version 
18.2.4 (e7ad5345525c7aa95470c26863873b581076945d) reef (stable)

:2025-03-26 13:15:21,412 7f93b227e740 DEBUG systemctl: enabled
2025-03-26 13:15:21,421 7f93b227e740 DEBUG systemctl: active
2025-03-26 13:15:21,451 7f93b227e740 DEBUG /usr/bin/podman: 
bfe17e83b28821be0ec399cde79965ade3bc3377c5acf05ef047395ddde4d804,quay.io/ceph/ceph@sha256:6ac7f923aa1d23b43248ce0ddec7e1388855ee3d00813b52c31
72b0b23b37906,2bc0b0f4375ddf4270a9a865dfd4e53063acc8e6c3afd7a2546507cafd2ec86a,2025-03-26 
06:53:07.022104802 -0400 EDT,

20

[ceph-users] Re: reef 18.2.5 QE validation status

2025-03-26 Thread Yuri Weinstein
I added a run and rerun for the fs suite on a fix
https://github.com/ceph/ceph/pull/62492

Venky, pls review and if approved I will merge it to reef and
cherry-pick to the release branch.

On Wed, Mar 26, 2025 at 8:04 AM Adam King  wrote:
>
> orch approved. The suite is obviously quite red, but the vast majority of the 
> failures are just due to the lack of a proper ignorelist in the orch suite on 
> reef.
>
> On Mon, Mar 24, 2025 at 5:40 PM Yuri Weinstein  wrote:
>>
>> Details of this release are summarized here:
>>
>> https://tracker.ceph.com/issues/70563#note-1
>> Release Notes - TBD
>> LRC upgrade - TBD
>>
>> Seeking approvals/reviews for:
>>
>> smoke - Laura approved?
>>
>> rados - Radek, Laura approved? Travis?  Nizamudeen? Adam King approved?
>>
>> rgw - Adam E approved?
>>
>> fs - Venky is fixing QA suite, will need to be added and rerun
>>
>> orch - Adam King approved?
>>
>> rbd - Ilya approved?
>> krbd - Ilya approved?
>> upgrade-clients:client-upgrade-octopus-reef-reef - Ilya please take a look.
>>
>> upgrade/pacific-x (reef) - can this be deprecated?  Josh?  Neha?
>> upgrade/quincy-x (reef) - Laura, Prashant please take a look.
>>
>> ceph-volume - Guillaume is fixing it.
>>
>> TIA
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Prometheus anomaly in Reef

2025-03-26 Thread Eugen Block
Right, systemctl edit works as well. But I'm confused about the down  
OSDs. Did you set the proxy on all hosts? Because the down OSDs are on  
ceph06 while prometheus is supposed to run on dell02. Are you sure  
those are related?


I would recommend to remove the prometheus service entirely and start  
from scratch:


ceph orch rm prometheus
ceph mgr module disable prometheus
ceph mgr fail

Wait a minute, then enable it again and deploy prometheus:

ceph orch apply -i prometheus.yaml
ceph mgr module enable prometheus



Zitat von Tim Holloway :

Since the containers are all podman, I found a "systemctl edit  
podman" command that's recommended to set proxy for that.


However, once I did, 2 OSDs went down and cannot be restarted.

In any event, before I did that, ceph health detail was returning  
"HEALTH OK".


Now I'm getting this:

HEALTH_ERR 2 failed cephadm daemon(s); Module 'prometheus' has  
failed: gaierror(-2, 'Name or service not known'); too many PGs per  
OSD (865 > max 560)

[WRN] CEPHADM_FAILED_DAEMON: 2 failed cephadm daemon(s)
    daemon osd.3 on ceph06.internal.mousetech.com is in error state
    daemon osd.2 on ceph08.internal.mousetech.com is in error state
[ERR] MGR_MODULE_ERROR: Module 'prometheus' has failed: gaierror(-2,  
'Name or service not known')

    Module 'prometheus' has failed: gaierror(-2, 'Name or service not known')
[WRN] TOO_MANY_PGS: too many PGs per OSD (865 > max 560)

On 3/26/25 12:07, Eugen Block wrote:
If you need a proxy to pull the images, I suggest to set it in the  
containers.conf:


cat /etc/containers/containers.conf
[engine]
env = ["http_proxy=:", "https_proxy=:",  
"no_proxy="]


But again, you should be able to see a failed to pull in the  
cephadm.log on dell02. Or even in 'ceph health detail', usually it  
warns you if the orchestrator failed to place a daemon.


Zitat von Tim Holloway :

One thing I did run into when upgrading was TLS issues pulling  
images. I had to set HTTP/S_PROXY and pull manually.


That may relate to this:

025-03-26T10:52:16.547985+ mgr.dell02.zwnrme (mgr.18015288)  
23874 : cephadm [INF] Saving service prometheus spec with  
placement dell02.mousetech.com
2025-03-26T10:52:16.560810+ mgr.dell02.zwnrme (mgr.18015288)  
23875 : cephadm [INF] Saving service node-exporter spec with  
placement *
2025-03-26T10:52:16.572380+ mgr.dell02.zwnrme (mgr.18015288)  
23876 : cephadm [INF] Saving service alertmanager spec with  
placement dell02.mousetech.com
2025-03-26T10:52:16.583555+ mgr.dell02.zwnrme (mgr.18015288)  
23878 : cephadm [INF] Saving service grafana spec with placement  
dell02.mousetech.com
2025-03-26T10:52:16.601713+ mgr.dell02.zwnrme (mgr.18015288)  
23879 : cephadm [INF] Saving service ceph-exporter spec with  
placement *
2025-03-26T10:52:44.139886+ mgr.dell02.zwnrme (mgr.18015288)  
23898 : cephadm [INF] Restart service mgr
2025-03-26T10:53:02.720157+ mgr.ceph08.tlocfi (mgr.18043792) 7  
: cephadm [INF] [26/Mar/2025:10:53:02] ENGINE Bus STARTING
2025-03-26T10:53:02.824138+ mgr.ceph08.tlocfi (mgr.18043792) 8  
: cephadm [INF] [26/Mar/2025:10:53:02] ENGINE Serving on  
http://10.0.1.58:8765
2025-03-26T10:53:02.962314+ mgr.ceph08.tlocfi (mgr.18043792) 9  
: cephadm [INF] [26/Mar/2025:10:53:02] ENGINE Serving on  
https://10.0.1.58:7150
2025-03-26T10:53:02.962805+ mgr.ceph08.tlocfi (mgr.18043792)  
10 : cephadm [INF] [26/Mar/2025:10:53:02] ENGINE Bus STARTED
2025-03-26T10:53:02.964966+ mgr.ceph08.tlocfi (mgr.18043792)  
11 : cephadm [ERR] [26/Mar/2025:10:53:02] ENGINE Error in  
HTTPServer.serve

Traceback (most recent call last):
  File "/lib/python3.9/site-packages/cheroot/server.py", line  
1823, in serve

    self._connections.run(self.expiration_interval)
  File "/lib/python3.9/site-packages/cheroot/connections.py", line  
203, in run

    self._run(expiration_interval)
  File "/lib/python3.9/site-packages/cheroot/connections.py", line  
246, in _run

    new_conn = self._from_server_socket(self.server.socket)
  File "/lib/python3.9/site-packages/cheroot/connections.py", line  
300, in _from_server_socket

    s, ssl_env = self.server.ssl_adapter.wrap(s)
  File "/lib/python3.9/site-packages/cheroot/ssl/builtin.py", line  
277, in wrap

    s = self.context.wrap_socket(
  File "/lib64/python3.9/ssl.py", line 501, in wrap_socket
    return self.sslsocket_class._create(
  File "/lib64/python3.9/ssl.py", line 1074, in _create
    self.do_handshake()
  File "/lib64/python3.9/ssl.py", line 1343, in do_handshake
    self._sslobj.do_handshake()
ssl.SSLZeroReturnError: TLS/SSL connection has been closed (EOF)  
(_ssl.c:1133)


2025-03-26T10:53:03.471114+ mgr.ceph08.tlocfi (mgr.18043792)  
12 : cephadm [INF] Updating  
ceph03.internal.mousetech.com:/etc/ceph/ceph.conf


On 3/26/25 11:39, Eugen Block wrote:
Then maybe the deployment did fail and we’re back at looking into  
the cephadm.log.



Zitat von Tim Holloway :

it returns nothing. I'd already done the same via "systemctl |

[ceph-users] Re: Prometheus anomaly in Reef

2025-03-26 Thread Tim Holloway
It's strange, but for a while I'd been trying to get prometheus working 
on ceph08, so I don't know.


All I do know is immediately after editing the proxy settings I got 
indications that those 2 OSDs had gone down.


What's REALLY strange is that their logs seem to hint that somehow they 
shifted from administered to legacy configuration. That is, looking for 
OSD resources under /var/lib/ceph instead of /var/lib/ceph/.


Anyway, I'll try yanking and re-deploying prometheus and maybe that will 
magically cure something.


On 3/26/25 12:53, Eugen Block wrote:
Right, systemctl edit works as well. But I'm confused about the down 
OSDs. Did you set the proxy on all hosts? Because the down OSDs are on 
ceph06 while prometheus is supposed to run on dell02. Are you sure 
those are related?


I would recommend to remove the prometheus service entirely and start 
from scratch:


ceph orch rm prometheus
ceph mgr module disable prometheus
ceph mgr fail

Wait a minute, then enable it again and deploy prometheus:

ceph orch apply -i prometheus.yaml
ceph mgr module enable prometheus



Zitat von Tim Holloway :

Since the containers are all podman, I found a "systemctl edit 
podman" command that's recommended to set proxy for that.


However, once I did, 2 OSDs went down and cannot be restarted.

In any event, before I did that, ceph health detail was returning 
"HEALTH OK".


Now I'm getting this:

HEALTH_ERR 2 failed cephadm daemon(s); Module 'prometheus' has 
failed: gaierror(-2, 'Name or service not known'); too many PGs per 
OSD (865 > max 560)

[WRN] CEPHADM_FAILED_DAEMON: 2 failed cephadm daemon(s)
    daemon osd.3 on ceph06.internal.mousetech.com is in error state
    daemon osd.2 on ceph08.internal.mousetech.com is in error state
[ERR] MGR_MODULE_ERROR: Module 'prometheus' has failed: gaierror(-2, 
'Name or service not known')
    Module 'prometheus' has failed: gaierror(-2, 'Name or service not 
known')

[WRN] TOO_MANY_PGS: too many PGs per OSD (865 > max 560)

On 3/26/25 12:07, Eugen Block wrote:
If you need a proxy to pull the images, I suggest to set it in the 
containers.conf:


cat /etc/containers/containers.conf
[engine]
env = ["http_proxy=:", "https_proxy=:", 
"no_proxy="]


But again, you should be able to see a failed to pull in the 
cephadm.log on dell02. Or even in 'ceph health detail', usually it 
warns you if the orchestrator failed to place a daemon.


Zitat von Tim Holloway :

One thing I did run into when upgrading was TLS issues pulling 
images. I had to set HTTP/S_PROXY and pull manually.


That may relate to this:

025-03-26T10:52:16.547985+ mgr.dell02.zwnrme (mgr.18015288) 
23874 : cephadm [INF] Saving service prometheus spec with placement 
dell02.mousetech.com
2025-03-26T10:52:16.560810+ mgr.dell02.zwnrme (mgr.18015288) 
23875 : cephadm [INF] Saving service node-exporter spec with 
placement *
2025-03-26T10:52:16.572380+ mgr.dell02.zwnrme (mgr.18015288) 
23876 : cephadm [INF] Saving service alertmanager spec with 
placement dell02.mousetech.com
2025-03-26T10:52:16.583555+ mgr.dell02.zwnrme (mgr.18015288) 
23878 : cephadm [INF] Saving service grafana spec with placement 
dell02.mousetech.com
2025-03-26T10:52:16.601713+ mgr.dell02.zwnrme (mgr.18015288) 
23879 : cephadm [INF] Saving service ceph-exporter spec with 
placement *
2025-03-26T10:52:44.139886+ mgr.dell02.zwnrme (mgr.18015288) 
23898 : cephadm [INF] Restart service mgr
2025-03-26T10:53:02.720157+ mgr.ceph08.tlocfi (mgr.18043792) 7 
: cephadm [INF] [26/Mar/2025:10:53:02] ENGINE Bus STARTING
2025-03-26T10:53:02.824138+ mgr.ceph08.tlocfi (mgr.18043792) 8 
: cephadm [INF] [26/Mar/2025:10:53:02] ENGINE Serving on 
http://10.0.1.58:8765
2025-03-26T10:53:02.962314+ mgr.ceph08.tlocfi (mgr.18043792) 9 
: cephadm [INF] [26/Mar/2025:10:53:02] ENGINE Serving on 
https://10.0.1.58:7150
2025-03-26T10:53:02.962805+ mgr.ceph08.tlocfi (mgr.18043792) 10 
: cephadm [INF] [26/Mar/2025:10:53:02] ENGINE Bus STARTED
2025-03-26T10:53:02.964966+ mgr.ceph08.tlocfi (mgr.18043792) 11 
: cephadm [ERR] [26/Mar/2025:10:53:02] ENGINE Error in 
HTTPServer.serve

Traceback (most recent call last):
  File "/lib/python3.9/site-packages/cheroot/server.py", line 1823, 
in serve

    self._connections.run(self.expiration_interval)
  File "/lib/python3.9/site-packages/cheroot/connections.py", line 
203, in run

    self._run(expiration_interval)
  File "/lib/python3.9/site-packages/cheroot/connections.py", line 
246, in _run

    new_conn = self._from_server_socket(self.server.socket)
  File "/lib/python3.9/site-packages/cheroot/connections.py", line 
300, in _from_server_socket

    s, ssl_env = self.server.ssl_adapter.wrap(s)
  File "/lib/python3.9/site-packages/cheroot/ssl/builtin.py", line 
277, in wrap

    s = self.context.wrap_socket(
  File "/lib64/python3.9/ssl.py", line 501, in wrap_socket
    return self.sslsocket_class._create(
  File "/lib64/python3.9/ssl.py", line 1074, in _create
    self.do_handshake()
  File "

[ceph-users] Re: Prometheus anomaly in Reef

2025-03-26 Thread Tim Holloway

No change.

On 3/26/25 13:01, Tim Holloway wrote:
It's strange, but for a while I'd been trying to get prometheus 
working on ceph08, so I don't know.


All I do know is immediately after editing the proxy settings I got 
indications that those 2 OSDs had gone down.


What's REALLY strange is that their logs seem to hint that somehow 
they shifted from administered to legacy configuration. That is, 
looking for OSD resources under /var/lib/ceph instead of 
/var/lib/ceph/.


Anyway, I'll try yanking and re-deploying prometheus and maybe that 
will magically cure something.


On 3/26/25 12:53, Eugen Block wrote:
Right, systemctl edit works as well. But I'm confused about the down 
OSDs. Did you set the proxy on all hosts? Because the down OSDs are 
on ceph06 while prometheus is supposed to run on dell02. Are you sure 
those are related?


I would recommend to remove the prometheus service entirely and start 
from scratch:


ceph orch rm prometheus
ceph mgr module disable prometheus
ceph mgr fail

Wait a minute, then enable it again and deploy prometheus:

ceph orch apply -i prometheus.yaml
ceph mgr module enable prometheus



Zitat von Tim Holloway :

Since the containers are all podman, I found a "systemctl edit 
podman" command that's recommended to set proxy for that.


However, once I did, 2 OSDs went down and cannot be restarted.

In any event, before I did that, ceph health detail was returning 
"HEALTH OK".


Now I'm getting this:

HEALTH_ERR 2 failed cephadm daemon(s); Module 'prometheus' has 
failed: gaierror(-2, 'Name or service not known'); too many PGs per 
OSD (865 > max 560)

[WRN] CEPHADM_FAILED_DAEMON: 2 failed cephadm daemon(s)
    daemon osd.3 on ceph06.internal.mousetech.com is in error state
    daemon osd.2 on ceph08.internal.mousetech.com is in error state
[ERR] MGR_MODULE_ERROR: Module 'prometheus' has failed: gaierror(-2, 
'Name or service not known')
    Module 'prometheus' has failed: gaierror(-2, 'Name or service 
not known')

[WRN] TOO_MANY_PGS: too many PGs per OSD (865 > max 560)

On 3/26/25 12:07, Eugen Block wrote:
If you need a proxy to pull the images, I suggest to set it in the 
containers.conf:


cat /etc/containers/containers.conf
[engine]
env = ["http_proxy=:", "https_proxy=:", 
"no_proxy="]


But again, you should be able to see a failed to pull in the 
cephadm.log on dell02. Or even in 'ceph health detail', usually it 
warns you if the orchestrator failed to place a daemon.


Zitat von Tim Holloway :

One thing I did run into when upgrading was TLS issues pulling 
images. I had to set HTTP/S_PROXY and pull manually.


That may relate to this:

025-03-26T10:52:16.547985+ mgr.dell02.zwnrme (mgr.18015288) 
23874 : cephadm [INF] Saving service prometheus spec with 
placement dell02.mousetech.com
2025-03-26T10:52:16.560810+ mgr.dell02.zwnrme (mgr.18015288) 
23875 : cephadm [INF] Saving service node-exporter spec with 
placement *
2025-03-26T10:52:16.572380+ mgr.dell02.zwnrme (mgr.18015288) 
23876 : cephadm [INF] Saving service alertmanager spec with 
placement dell02.mousetech.com
2025-03-26T10:52:16.583555+ mgr.dell02.zwnrme (mgr.18015288) 
23878 : cephadm [INF] Saving service grafana spec with placement 
dell02.mousetech.com
2025-03-26T10:52:16.601713+ mgr.dell02.zwnrme (mgr.18015288) 
23879 : cephadm [INF] Saving service ceph-exporter spec with 
placement *
2025-03-26T10:52:44.139886+ mgr.dell02.zwnrme (mgr.18015288) 
23898 : cephadm [INF] Restart service mgr
2025-03-26T10:53:02.720157+ mgr.ceph08.tlocfi (mgr.18043792) 7 
: cephadm [INF] [26/Mar/2025:10:53:02] ENGINE Bus STARTING
2025-03-26T10:53:02.824138+ mgr.ceph08.tlocfi (mgr.18043792) 8 
: cephadm [INF] [26/Mar/2025:10:53:02] ENGINE Serving on 
http://10.0.1.58:8765
2025-03-26T10:53:02.962314+ mgr.ceph08.tlocfi (mgr.18043792) 9 
: cephadm [INF] [26/Mar/2025:10:53:02] ENGINE Serving on 
https://10.0.1.58:7150
2025-03-26T10:53:02.962805+ mgr.ceph08.tlocfi (mgr.18043792) 
10 : cephadm [INF] [26/Mar/2025:10:53:02] ENGINE Bus STARTED
2025-03-26T10:53:02.964966+ mgr.ceph08.tlocfi (mgr.18043792) 
11 : cephadm [ERR] [26/Mar/2025:10:53:02] ENGINE Error in 
HTTPServer.serve

Traceback (most recent call last):
  File "/lib/python3.9/site-packages/cheroot/server.py", line 
1823, in serve

    self._connections.run(self.expiration_interval)
  File "/lib/python3.9/site-packages/cheroot/connections.py", line 
203, in run

    self._run(expiration_interval)
  File "/lib/python3.9/site-packages/cheroot/connections.py", line 
246, in _run

    new_conn = self._from_server_socket(self.server.socket)
  File "/lib/python3.9/site-packages/cheroot/connections.py", line 
300, in _from_server_socket

    s, ssl_env = self.server.ssl_adapter.wrap(s)
  File "/lib/python3.9/site-packages/cheroot/ssl/builtin.py", line 
277, in wrap

    s = self.context.wrap_socket(
  File "/lib64/python3.9/ssl.py", line 501, in wrap_socket
    return self.sslsocket_class._create(
  File "/lib64/python3.9/ssl.py", li

[ceph-users] Re: reef 18.2.5 QE validation status

2025-03-26 Thread Venky Shankar
On Wed, Mar 26, 2025 at 8:37 PM Yuri Weinstein  wrote:
>
> I added a run and rerun for the fs suite on a fix
> https://github.com/ceph/ceph/pull/62492
>
> Venky, pls review and if approved I will merge it to reef and
> cherry-pick to the release branch.

Noted. I will let you know when it's ready to merge.

>
> On Wed, Mar 26, 2025 at 8:04 AM Adam King  wrote:
> >
> > orch approved. The suite is obviously quite red, but the vast majority of 
> > the failures are just due to the lack of a proper ignorelist in the orch 
> > suite on reef.
> >
> > On Mon, Mar 24, 2025 at 5:40 PM Yuri Weinstein  wrote:
> >>
> >> Details of this release are summarized here:
> >>
> >> https://tracker.ceph.com/issues/70563#note-1
> >> Release Notes - TBD
> >> LRC upgrade - TBD
> >>
> >> Seeking approvals/reviews for:
> >>
> >> smoke - Laura approved?
> >>
> >> rados - Radek, Laura approved? Travis?  Nizamudeen? Adam King approved?
> >>
> >> rgw - Adam E approved?
> >>
> >> fs - Venky is fixing QA suite, will need to be added and rerun
> >>
> >> orch - Adam King approved?
> >>
> >> rbd - Ilya approved?
> >> krbd - Ilya approved?
> >> upgrade-clients:client-upgrade-octopus-reef-reef - Ilya please take a look.
> >>
> >> upgrade/pacific-x (reef) - can this be deprecated?  Josh?  Neha?
> >> upgrade/quincy-x (reef) - Laura, Prashant please take a look.
> >>
> >> ceph-volume - Guillaume is fixing it.
> >>
> >> TIA
> >> ___
> >> ceph-users mailing list -- ceph-users@ceph.io
> >> To unsubscribe send an email to ceph-users-le...@ceph.io
> >>
> ___
> Dev mailing list -- d...@ceph.io
> To unsubscribe send an email to dev-le...@ceph.io



-- 
Cheers,
Venky
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: reef 18.2.5 QE validation status

2025-03-26 Thread Venky Shankar
Hi Yuri,

On Wed, Mar 26, 2025 at 8:59 PM Venky Shankar  wrote:
>
> On Wed, Mar 26, 2025 at 8:37 PM Yuri Weinstein  wrote:
> >
> > I added a run and rerun for the fs suite on a fix
> > https://github.com/ceph/ceph/pull/62492
> >
> > Venky, pls review and if approved I will merge it to reef and
> > cherry-pick to the release branch.
>
> Noted. I will let you know when it's ready to merge.

The PR has been approved and is ready to merge (once it finishes jenkins tests).

>
> >
> > On Wed, Mar 26, 2025 at 8:04 AM Adam King  wrote:
> > >
> > > orch approved. The suite is obviously quite red, but the vast majority of 
> > > the failures are just due to the lack of a proper ignorelist in the orch 
> > > suite on reef.
> > >
> > > On Mon, Mar 24, 2025 at 5:40 PM Yuri Weinstein  
> > > wrote:
> > >>
> > >> Details of this release are summarized here:
> > >>
> > >> https://tracker.ceph.com/issues/70563#note-1
> > >> Release Notes - TBD
> > >> LRC upgrade - TBD
> > >>
> > >> Seeking approvals/reviews for:
> > >>
> > >> smoke - Laura approved?
> > >>
> > >> rados - Radek, Laura approved? Travis?  Nizamudeen? Adam King approved?
> > >>
> > >> rgw - Adam E approved?
> > >>
> > >> fs - Venky is fixing QA suite, will need to be added and rerun
> > >>
> > >> orch - Adam King approved?
> > >>
> > >> rbd - Ilya approved?
> > >> krbd - Ilya approved?
> > >> upgrade-clients:client-upgrade-octopus-reef-reef - Ilya please take a 
> > >> look.
> > >>
> > >> upgrade/pacific-x (reef) - can this be deprecated?  Josh?  Neha?
> > >> upgrade/quincy-x (reef) - Laura, Prashant please take a look.
> > >>
> > >> ceph-volume - Guillaume is fixing it.
> > >>
> > >> TIA
> > >> ___
> > >> ceph-users mailing list -- ceph-users@ceph.io
> > >> To unsubscribe send an email to ceph-users-le...@ceph.io
> > >>
> > ___
> > Dev mailing list -- d...@ceph.io
> > To unsubscribe send an email to dev-le...@ceph.io
>
>
>
> --
> Cheers,
> Venky



-- 
Cheers,
Venky
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Prometheus anomaly in Reef

2025-03-26 Thread Eugen Block
Ok, I'll try one last time and ask for cephadm.log output. ;-) And the  
active MGR's log might help here as well.


Zitat von Tim Holloway :


No change.

On 3/26/25 13:01, Tim Holloway wrote:
It's strange, but for a while I'd been trying to get prometheus  
working on ceph08, so I don't know.


All I do know is immediately after editing the proxy settings I got  
indications that those 2 OSDs had gone down.


What's REALLY strange is that their logs seem to hint that somehow  
they shifted from administered to legacy configuration. That is,  
looking for OSD resources under /var/lib/ceph instead of  
/var/lib/ceph/.


Anyway, I'll try yanking and re-deploying prometheus and maybe that  
will magically cure something.


On 3/26/25 12:53, Eugen Block wrote:
Right, systemctl edit works as well. But I'm confused about the  
down OSDs. Did you set the proxy on all hosts? Because the down  
OSDs are on ceph06 while prometheus is supposed to run on dell02.  
Are you sure those are related?


I would recommend to remove the prometheus service entirely and  
start from scratch:


ceph orch rm prometheus
ceph mgr module disable prometheus
ceph mgr fail

Wait a minute, then enable it again and deploy prometheus:

ceph orch apply -i prometheus.yaml
ceph mgr module enable prometheus



Zitat von Tim Holloway :

Since the containers are all podman, I found a "systemctl edit  
podman" command that's recommended to set proxy for that.


However, once I did, 2 OSDs went down and cannot be restarted.

In any event, before I did that, ceph health detail was returning  
"HEALTH OK".


Now I'm getting this:

HEALTH_ERR 2 failed cephadm daemon(s); Module 'prometheus' has  
failed: gaierror(-2, 'Name or service not known'); too many PGs  
per OSD (865 > max 560)

[WRN] CEPHADM_FAILED_DAEMON: 2 failed cephadm daemon(s)
    daemon osd.3 on ceph06.internal.mousetech.com is in error state
    daemon osd.2 on ceph08.internal.mousetech.com is in error state
[ERR] MGR_MODULE_ERROR: Module 'prometheus' has failed:  
gaierror(-2, 'Name or service not known')
    Module 'prometheus' has failed: gaierror(-2, 'Name or service  
not known')

[WRN] TOO_MANY_PGS: too many PGs per OSD (865 > max 560)

On 3/26/25 12:07, Eugen Block wrote:
If you need a proxy to pull the images, I suggest to set it in  
the containers.conf:


cat /etc/containers/containers.conf
[engine]
env = ["http_proxy=:", "https_proxy=:",  
"no_proxy="]


But again, you should be able to see a failed to pull in the  
cephadm.log on dell02. Or even in 'ceph health detail', usually  
it warns you if the orchestrator failed to place a daemon.


Zitat von Tim Holloway :

One thing I did run into when upgrading was TLS issues pulling  
images. I had to set HTTP/S_PROXY and pull manually.


That may relate to this:

025-03-26T10:52:16.547985+ mgr.dell02.zwnrme (mgr.18015288)  
23874 : cephadm [INF] Saving service prometheus spec with  
placement dell02.mousetech.com
2025-03-26T10:52:16.560810+ mgr.dell02.zwnrme  
(mgr.18015288) 23875 : cephadm [INF] Saving service  
node-exporter spec with placement *
2025-03-26T10:52:16.572380+ mgr.dell02.zwnrme  
(mgr.18015288) 23876 : cephadm [INF] Saving service  
alertmanager spec with placement dell02.mousetech.com
2025-03-26T10:52:16.583555+ mgr.dell02.zwnrme  
(mgr.18015288) 23878 : cephadm [INF] Saving service grafana  
spec with placement dell02.mousetech.com
2025-03-26T10:52:16.601713+ mgr.dell02.zwnrme  
(mgr.18015288) 23879 : cephadm [INF] Saving service  
ceph-exporter spec with placement *
2025-03-26T10:52:44.139886+ mgr.dell02.zwnrme  
(mgr.18015288) 23898 : cephadm [INF] Restart service mgr
2025-03-26T10:53:02.720157+ mgr.ceph08.tlocfi  
(mgr.18043792) 7 : cephadm [INF] [26/Mar/2025:10:53:02] ENGINE  
Bus STARTING
2025-03-26T10:53:02.824138+ mgr.ceph08.tlocfi  
(mgr.18043792) 8 : cephadm [INF] [26/Mar/2025:10:53:02] ENGINE  
Serving on http://10.0.1.58:8765
2025-03-26T10:53:02.962314+ mgr.ceph08.tlocfi  
(mgr.18043792) 9 : cephadm [INF] [26/Mar/2025:10:53:02] ENGINE  
Serving on https://10.0.1.58:7150
2025-03-26T10:53:02.962805+ mgr.ceph08.tlocfi  
(mgr.18043792) 10 : cephadm [INF] [26/Mar/2025:10:53:02] ENGINE  
Bus STARTED
2025-03-26T10:53:02.964966+ mgr.ceph08.tlocfi  
(mgr.18043792) 11 : cephadm [ERR] [26/Mar/2025:10:53:02] ENGINE  
Error in HTTPServer.serve

Traceback (most recent call last):
  File "/lib/python3.9/site-packages/cheroot/server.py", line  
1823, in serve

    self._connections.run(self.expiration_interval)
  File "/lib/python3.9/site-packages/cheroot/connections.py",  
line 203, in run

    self._run(expiration_interval)
  File "/lib/python3.9/site-packages/cheroot/connections.py",  
line 246, in _run

    new_conn = self._from_server_socket(self.server.socket)
  File "/lib/python3.9/site-packages/cheroot/connections.py",  
line 300, in _from_server_socket

    s, ssl_env = self.server.ssl_adapter.wrap(s)
  File "/lib/python3.9/site-packages/cheroot/ssl/builtin.py",

[ceph-users] Re: reef 18.2.5 QE validation status

2025-03-26 Thread Yuri Weinstein
Ack, Travis
I was about to reply the same.

Venky, Guillaume the PRs below were cherry-picked
I will rerun the fs and ceph-volume tests when the build is done

https://github.com/ceph/ceph/pull/62492/commits
https://github.com/ceph/ceph/pull/62178/commits

On Wed, Mar 26, 2025 at 2:20 PM Travis Nielsen  wrote:
>
> Oh sorry, forget my last email, thanks Laura for pointing out the obvious 
> that this is for reef, not squid!
>
> On Wed, Mar 26, 2025 at 2:46 PM Travis Nielsen  wrote:
>>
>> Yuri, as of when did 18.2.5 include the latest squid branch? If [1] is 
>> included in 18.2.5, then we really need [2] merged before release, as it 
>> would be blocking Rook.
>>
>> [1] https://github.com/ceph/ceph/pull/62095 (merged to squid on March 19)
>> [2] https://tracker.ceph.com/issues/70667
>>
>> Thanks!
>> Travis
>>
>> On Wed, Mar 26, 2025 at 2:09 PM Ilya Dryomov  wrote:
>>>
>>> On Mon, Mar 24, 2025 at 10:40 PM Yuri Weinstein  wrote:
>>> >
>>> > Details of this release are summarized here:
>>> >
>>> > https://tracker.ceph.com/issues/70563#note-1
>>> > Release Notes - TBD
>>> > LRC upgrade - TBD
>>> >
>>> > Seeking approvals/reviews for:
>>> >
>>> > smoke - Laura approved?
>>> >
>>> > rados - Radek, Laura approved? Travis?  Nizamudeen? Adam King approved?
>>> >
>>> > rgw - Adam E approved?
>>> >
>>> > fs - Venky is fixing QA suite, will need to be added and rerun
>>> >
>>> > orch - Adam King approved?
>>> >
>>> > rbd - Ilya approved?
>>> > krbd - Ilya approved?
>>>
>>> Hi Yuri,
>>>
>>> rbd and krbd approved.
>>>
>>> > upgrade-clients:client-upgrade-octopus-reef-reef - Ilya please take a 
>>> > look.
>>>
>>> I don't recall seeing this before -- try rerunning it a couple of times?
>>>
>>> Thanks,
>>>
>>> Ilya
>>> ___
>>> ceph-users mailing list -- ceph-users@ceph.io
>>> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Prometheus anomaly in Reef

2025-03-26 Thread Tim Holloway

Sorry, duplicated a URL. The mgr log is

https://www.mousetech.com/share/ceph-mgr.log
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Production cluster in bad shape after several OSD crashes

2025-03-26 Thread Michel Jouvin

Hi,

We have a production cluster made of 3 mon+mgr, 18 OSD servers and ~500 
OSDs and configured with ~50 pools, 1/2 EC (9+6) and 1/2 replica 3. It 
also has 2 CephFS filesystems with 1 MDS each.


2 days ago, in a period spanning 16 hours, 13 OSD crashed with an OOM. 
The OSD were first restarted but it was decided to reboot the server 
with a crashed OSD and "by mistake" (it was at least useless), the OSD 
of the rebooted server were set noout,norebalance before the reboot. The 
flags were removed after the reboot.


After all of this, 'ceph -s' started to report a lot of misplaced PG and 
recovery started. All the PGs but one were successfully reactivated. One 
stayed in the activating+remapped state (located in a pool used for 
tests). 'ceph health' (I don't put the details here to avoid a too long 
mail but I can shared them) says:


HEALTH_WARN 1 failed cephadm daemon(s); 1 filesystem is degraded; 2 MDSs 
report slow metadata IOs; Reduced data availability: 1 pg inactive; 13 
daemons have recently crashed


and reports about one of the filesystem being degraded despite the only 
PG reported inactive is not part of a pool related to the FS.


The recovery was slow until we realized we should change the mclock 
profile to high_recovery_ops. Then it completed in a few hours. 
Unfortunately the degraded filesystem remains degraded without an 
obvious reason... and the inactive page is still in the 
activating+remapped state. We have not been able to identify a relevant 
error in the logs up to now (but we may have missed something...).


So far we have avoided restarting too many things until we have a better 
understanding of what happened and what is the current state. We only 
restarted the mgr which was using a lot of CPU and the MDS for the 
degraded FS, without any improvement.


We are looking on advices about where to start... It seems we have (at 
least) 2 independent problems:


- A PG that cannot be reactivated with a remap operation that doesn't 
proceed: would stopping osd.17 help (so that osd.460 is reused)?


[root@ijc-mon1 ~]# ceph pg dump_stuck
PG_STAT  STATE    UP    UP_PRIMARY ACTING 
ACTING_PRIMARY
32.7ef   activating+remapped  [100,154,17] 100 
[100,154,460] 100


- 1 degraded filesystem: where to look for a reason?

Thanks in advance for any help?

Cheers,

Michel
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Production cluster in bad shape after several OSD crashes

2025-03-26 Thread Michel Jouvin

Hi again,

Looking for more info on the degraded filesystem, I managed to connect 
to the dashboard where I see an error not reported as explicitely by 
'ceph health' :


One or more metadata daemons (MDS ranks) are failed or in a damaged 
state. At best the filesystem is partially available, at worst the 
filesystem is completely unusable.


But I don't manage what can be done from this point... and I really 
don't understand how we ended up in such a state...


Cheers,

Michel

Le 26/03/2025 à 21:27, Michel Jouvin a écrit :

Hi,

We have a production cluster made of 3 mon+mgr, 18 OSD servers and 
~500 OSDs and configured with ~50 pools, 1/2 EC (9+6) and 1/2 replica 
3. It also has 2 CephFS filesystems with 1 MDS each.


2 days ago, in a period spanning 16 hours, 13 OSD crashed with an OOM. 
The OSD were first restarted but it was decided to reboot the server 
with a crashed OSD and "by mistake" (it was at least useless), the OSD 
of the rebooted server were set noout,norebalance before the reboot. 
The flags were removed after the reboot.


After all of this, 'ceph -s' started to report a lot of misplaced PG 
and recovery started. All the PGs but one were successfully 
reactivated. One stayed in the activating+remapped state (located in a 
pool used for tests). 'ceph health' (I don't put the details here to 
avoid a too long mail but I can shared them) says:


HEALTH_WARN 1 failed cephadm daemon(s); 1 filesystem is degraded; 2 
MDSs report slow metadata IOs; Reduced data availability: 1 pg 
inactive; 13 daemons have recently crashed


and reports about one of the filesystem being degraded despite the 
only PG reported inactive is not part of a pool related to the FS.


The recovery was slow until we realized we should change the mclock 
profile to high_recovery_ops. Then it completed in a few hours. 
Unfortunately the degraded filesystem remains degraded without an 
obvious reason... and the inactive page is still in the 
activating+remapped state. We have not been able to identify a 
relevant error in the logs up to now (but we may have missed 
something...).


So far we have avoided restarting too many things until we have a 
better understanding of what happened and what is the current state. 
We only restarted the mgr which was using a lot of CPU and the MDS for 
the degraded FS, without any improvement.


We are looking on advices about where to start... It seems we have (at 
least) 2 independent problems:


- A PG that cannot be reactivated with a remap operation that doesn't 
proceed: would stopping osd.17 help (so that osd.460 is reused)?


[root@ijc-mon1 ~]# ceph pg dump_stuck
PG_STAT  STATE    UP    UP_PRIMARY ACTING 
ACTING_PRIMARY
32.7ef   activating+remapped  [100,154,17] 100 
[100,154,460] 100


- 1 degraded filesystem: where to look for a reason?

Thanks in advance for any help?

Cheers,

Michel
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Prometheus anomaly in Reef

2025-03-26 Thread Tim Holloway

OSD mystery is solved.

Both OSDs were LVM-based imported as vdisks for Ceph VMs. Apparently 
something scrambled either the VM manager or the host disk subsystem as 
the VM disks were getting I/O errors and even disappearing from the VM.


I rebooted the physical machine and that cleared it. All OSDs now happy 
again.


...

Well, it looks like one OSD has been damaged permanently, so I purged it. (:

On 3/26/25 15:08, Tim Holloway wrote:

Sorry, duplicated a URL. The mgr log is

https://www.mousetech.com/share/ceph-mgr.log
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Production cluster in bad shape after several OSD crashes

2025-03-26 Thread Michel Jouvin
And sorry for all these mails, I forgot to mention that we are running 
18.2.2.


Michel

Le 26/03/2025 à 21:51, Michel Jouvin a écrit :

Hi again,

Looking for more info on the degraded filesystem, I managed to connect 
to the dashboard where I see an error not reported as explicitely by 
'ceph health' :


One or more metadata daemons (MDS ranks) are failed or in a damaged 
state. At best the filesystem is partially available, at worst the 
filesystem is completely unusable.


But I don't manage what can be done from this point... and I really 
don't understand how we ended up in such a state...


Cheers,

Michel

Le 26/03/2025 à 21:27, Michel Jouvin a écrit :

Hi,

We have a production cluster made of 3 mon+mgr, 18 OSD servers and 
~500 OSDs and configured with ~50 pools, 1/2 EC (9+6) and 1/2 replica 
3. It also has 2 CephFS filesystems with 1 MDS each.


2 days ago, in a period spanning 16 hours, 13 OSD crashed with an 
OOM. The OSD were first restarted but it was decided to reboot the 
server with a crashed OSD and "by mistake" (it was at least useless), 
the OSD of the rebooted server were set noout,norebalance before the 
reboot. The flags were removed after the reboot.


After all of this, 'ceph -s' started to report a lot of misplaced PG 
and recovery started. All the PGs but one were successfully 
reactivated. One stayed in the activating+remapped state (located in 
a pool used for tests). 'ceph health' (I don't put the details here 
to avoid a too long mail but I can shared them) says:


HEALTH_WARN 1 failed cephadm daemon(s); 1 filesystem is degraded; 2 
MDSs report slow metadata IOs; Reduced data availability: 1 pg 
inactive; 13 daemons have recently crashed


and reports about one of the filesystem being degraded despite the 
only PG reported inactive is not part of a pool related to the FS.


The recovery was slow until we realized we should change the mclock 
profile to high_recovery_ops. Then it completed in a few hours. 
Unfortunately the degraded filesystem remains degraded without an 
obvious reason... and the inactive page is still in the 
activating+remapped state. We have not been able to identify a 
relevant error in the logs up to now (but we may have missed 
something...).


So far we have avoided restarting too many things until we have a 
better understanding of what happened and what is the current state. 
We only restarted the mgr which was using a lot of CPU and the MDS 
for the degraded FS, without any improvement.


We are looking on advices about where to start... It seems we have 
(at least) 2 independent problems:


- A PG that cannot be reactivated with a remap operation that doesn't 
proceed: would stopping osd.17 help (so that osd.460 is reused)?


[root@ijc-mon1 ~]# ceph pg dump_stuck
PG_STAT  STATE    UP    UP_PRIMARY ACTING 
ACTING_PRIMARY
32.7ef   activating+remapped  [100,154,17] 100 
[100,154,460] 100


- 1 degraded filesystem: where to look for a reason?

Thanks in advance for any help?

Cheers,

Michel
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Prometheus anomaly in Reef

2025-03-26 Thread Tim Holloway
OK. I couldn't find a quick way to shovel a largish file from an 
internal server into pastebin, but my own servers can suffice.


the URLs are:

https://www.mousetech.com/share/cephadm.log

https://www.mousetech.com/share/cephadm.log

And I don't see a deployment either.


On 3/26/25 14:26, Eugen Block wrote:
That would be the correct log file, but I don't see an attempt to 
deploy a prometheus instance there. You can use any pastebin you like, 
e. g. https://pastebin.com/ to upload your logs. Mask any sensitive 
data before you do that.



Zitat von Tim Holloway :

Well, here's an excerpt from the /var/log/ceph/cephadm.log. I don't 
know if that's the mechanism or file you mean, though.



2025-03-26 13:11:09,382 7fb2abc38740 DEBUG 


cephadm ['--no-container-init', '--timeout', '895', 'gather-facts']
2025-03-26 13:12:10,219 7fc4fd405740 DEBUG 


cephadm ['--no-container-init', '--timeout', '895', 'gather-facts']
2025-03-26 13:13:11,502 7f2ef3c76740 DEBUG 


cephadm ['--no-container-init', '--timeout', '895', 'gather-facts']
2025-03-26 13:14:12,372 7f3566bef740 DEBUG 


cephadm ['--no-container-init', '--timeout', '895', 'gather-facts']
2025-03-26 13:15:13,301 7f660e204740 DEBUG 


cephadm ['--no-container-init', '--timeout', '895', 'gather-facts']
2025-03-26 13:15:20,880 7f93b227e740 DEBUG 


cephadm ['ls']
2025-03-26 13:15:20,904 7f93b227e740 DEBUG /usr/bin/podman: 5.2.2
2025-03-26 13:15:20,939 7f93b227e740 DEBUG /usr/bin/podman: 
2149e16fa2ce,11.51MB / 33.24GB
2025-03-26 13:15:20,939 7f93b227e740 DEBUG /usr/bin/podman: 
65529d6ad1ac,17.69MB / 33.24GB
2025-03-26 13:15:20,939 7f93b227e740 DEBUG /usr/bin/podman: 
51b1d190dfb9,99.79MB / 33.24GB
2025-03-26 13:15:20,939 7f93b227e740 DEBUG /usr/bin/podman: 
59a865e3bcc5,6.791MB / 33.24GB
2025-03-26 13:15:20,939 7f93b227e740 DEBUG /usr/bin/podman: 
dd3203f6f3bb,410.2MB / 33.24GB
2025-03-26 13:15:20,939 7f93b227e740 DEBUG /usr/bin/podman: 
34177c4e5761,1.764GB / 33.24GB
2025-03-26 13:15:20,939 7f93b227e740 DEBUG /usr/bin/podman: 
bfe17e83b288,534.2MB / 33.24GB
2025-03-26 13:15:20,972 7f93b227e740 DEBUG /usr/bin/podman: 
2149e16fa2ce,0.00%
2025-03-26 13:15:20,972 7f93b227e740 DEBUG /usr/bin/podman: 
65529d6ad1ac,0.26%
2025-03-26 13:15:20,972 7f93b227e740 DEBUG /usr/bin/podman: 
51b1d190dfb9,0.22%
2025-03-26 13:15:20,972 7f93b227e740 DEBUG /usr/bin/podman: 
59a865e3bcc5,0.02%
2025-03-26 13:15:20,972 7f93b227e740 DEBUG /usr/bin/podman: 
dd3203f6f3bb,0.86%
2025-03-26 13:15:20,972 7f93b227e740 DEBUG /usr/bin/podman: 
34177c4e5761,1.67%
2025-03-26 13:15:20,972 7f93b227e740 DEBUG /usr/bin/podman: 
bfe17e83b288,0.25%

2025-03-26 13:15:20,985 7f93b227e740 DEBUG systemctl: enabled
2025-03-26 13:15:20,993 7f93b227e740 DEBUG systemctl: active
2025-03-26 13:15:21,024 7f93b227e740 DEBUG /usr/bin/podman: 
2149e16fa2ce8769bf3be9e6e25eec61b8e027b0e8699f1cb7d5f113fc4aac66,quay.io/prometheus/node-exporter:v1.5.0,0da6a335fe1356545476b749c68f022c897d

e3a2139e8f0054f6937349ee2b83,2025-03-25 16:52:31.644234532 -0400 EDT,
2025-03-26 13:15:21,057 7f93b227e740 DEBUG /usr/bin/podman: 
[quay.io/prometheus/node-exporter@sha256:39c642b2b337e38c18e80266fb14383754178202f40103646337722a594d984c 
quay.io/prometheus/node-exporter@sh

a256:fa8e5700b7762fffe0674e944762f44bb787a7e44d97569fe55348260453bf80]
2025-03-26 13:15:21,111 7f93b227e740 DEBUG /usr/bin/podman: 
node_exporter, version 1.5.0 (branch: HEAD, revision: 
1b48970ffcf5630534fb00bb0687d73c66d1c959)
2025-03-26 13:15:21,111 7f93b227e740 DEBUG /usr/bin/podman: build 
user:   root@6e7732a7b81b
2025-03-26 13:15:21,111 7f93b227e740 DEBUG /usr/bin/podman: build 
date:   20221129-18:59:09
2025-03-26 13:15:21,111 7f93b227e740 DEBUG /usr/bin/podman:   go 
version:   go1.19.3
2025-03-26 13:15:21,111 7f93b227e740 DEBUG /usr/bin/podman: 
platform: linux/amd64

2025-03-26 13:15:21,187 7f93b227e740 DEBUG systemctl: enabled
2025-03-26 13:15:21,196 7f93b227e740 DEBUG systemctl: active
2025-03-26 13:15:21,228 7f93b227e740 DEBUG /usr/bin/podman: 
59a865e3bcc5e86f6caed8278aec0cfed608bf89ff4953dfb48b762138955925,quay.io/ceph/ceph@sha256:6ac7f923aa1d23b43248ce0ddec7e1388855ee3d00813b52c31
72b0b23b37906,2bc0b0f4375ddf4270a9a865dfd4e53063acc8e6c3afd7a2546507cafd2ec86a,2025-03-25 
16:52:31.731849052 -0400 EDT,
2025-03-26 13:15:21,260 7f93b227e740 DEBUG /usr/bin/podman: 
[quay.io/ceph/ceph@sha256:6ac7f923aa1d23b43248ce0ddec7e1388855ee3d00813b52c3172b0b23b37906 
quay.io/ceph/ceph@sha256:ac06cdca6f2512a763f1ace85

53330e454152b82f95a2b6bf33c3f3ec2eeac77]
2025-03-26 13:15

[ceph-users] Re: reef 18.2.5 QE validation status

2025-03-26 Thread Ilya Dryomov
On Mon, Mar 24, 2025 at 10:40 PM Yuri Weinstein  wrote:
>
> Details of this release are summarized here:
>
> https://tracker.ceph.com/issues/70563#note-1
> Release Notes - TBD
> LRC upgrade - TBD
>
> Seeking approvals/reviews for:
>
> smoke - Laura approved?
>
> rados - Radek, Laura approved? Travis?  Nizamudeen? Adam King approved?
>
> rgw - Adam E approved?
>
> fs - Venky is fixing QA suite, will need to be added and rerun
>
> orch - Adam King approved?
>
> rbd - Ilya approved?
> krbd - Ilya approved?

Hi Yuri,

rbd and krbd approved.

> upgrade-clients:client-upgrade-octopus-reef-reef - Ilya please take a look.

I don't recall seeing this before -- try rerunning it a couple of times?

Thanks,

Ilya
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: reef 18.2.5 QE validation status

2025-03-26 Thread Guillaume ABRIOUX
Hi Yuri,

ceph-volume is missing this backport [1].
Also for this release you will need to run the teuthology orch/cephadm test 
suite for validating ceph-volume rather than the usual "ceph-volume functional 
test suite" [2]

[1] https://github.com/ceph/ceph/pull/62178
[2] https://jenkins.ceph.com/job/ceph-volume-test/


--
Guillaume Abrioux
Software Engineer"

De : Yuri Weinstein 
Envoyé : lundi 24 mars 2025 22:39
À : dev ; ceph-users 
Objet : [EXTERNAL] [ceph-users] reef 18.2.5 QE validation status

Details of this release are summarized here:

https://tracker.ceph.com/issues/70563#note-1 
Release Notes - TBD
LRC upgrade - TBD

Seeking approvals/reviews for:

smoke - Laura approved?

rados - Radek, Laura approved? Travis?  Nizamudeen? Adam King approved?

rgw - Adam E approved?

fs - Venky is fixing QA suite, will need to be added and rerun

orch - Adam King approved?

rbd - Ilya approved?
krbd - Ilya approved?
upgrade-clients:client-upgrade-octopus-reef-reef - Ilya please take a look.

upgrade/pacific-x (reef) - can this be deprecated?  Josh?  Neha?
upgrade/quincy-x (reef) - Laura, Prashant please take a look.

ceph-volume - Guillaume is fixing it.

TIA
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Unless otherwise stated above:

Compagnie IBM France
Siège Social : 17, avenue de l'Europe, 92275 Bois-Colombes Cedex
RCS Nanterre 552 118 465
Forme Sociale : S.A.S.
Capital Social : 664 614 175,50 €
SIRET : 552 118 465 03644 - Code NAF 6203Z
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: reef 18.2.5 QE validation status

2025-03-26 Thread Travis Nielsen
Yuri, as of when did 18.2.5 include the latest squid branch? If [1] is
included in 18.2.5, then we really need [2] merged before release, as it
would be blocking Rook.

[1] https://github.com/ceph/ceph/pull/62095 (merged to squid on March 19)
[2] https://tracker.ceph.com/issues/70667

Thanks!
Travis

On Wed, Mar 26, 2025 at 2:09 PM Ilya Dryomov  wrote:

> On Mon, Mar 24, 2025 at 10:40 PM Yuri Weinstein 
> wrote:
> >
> > Details of this release are summarized here:
> >
> > https://tracker.ceph.com/issues/70563#note-1
> > Release Notes - TBD
> > LRC upgrade - TBD
> >
> > Seeking approvals/reviews for:
> >
> > smoke - Laura approved?
> >
> > rados - Radek, Laura approved? Travis?  Nizamudeen? Adam King approved?
> >
> > rgw - Adam E approved?
> >
> > fs - Venky is fixing QA suite, will need to be added and rerun
> >
> > orch - Adam King approved?
> >
> > rbd - Ilya approved?
> > krbd - Ilya approved?
>
> Hi Yuri,
>
> rbd and krbd approved.
>
> > upgrade-clients:client-upgrade-octopus-reef-reef - Ilya please take a
> look.
>
> I don't recall seeing this before -- try rerunning it a couple of times?
>
> Thanks,
>
> Ilya
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: reef 18.2.5 QE validation status

2025-03-26 Thread Laura Flores
Rados approved:
https://tracker.ceph.com/projects/rados/wiki/REEF#v1825-httpstrackercephcomissues70563note-1

On Wed, Mar 26, 2025 at 12:22 PM Venky Shankar  wrote:

> Hi Yuri,
>
> On Wed, Mar 26, 2025 at 8:59 PM Venky Shankar  wrote:
> >
> > On Wed, Mar 26, 2025 at 8:37 PM Yuri Weinstein 
> wrote:
> > >
> > > I added a run and rerun for the fs suite on a fix
> > > https://github.com/ceph/ceph/pull/62492
> > >
> > > Venky, pls review and if approved I will merge it to reef and
> > > cherry-pick to the release branch.
> >
> > Noted. I will let you know when it's ready to merge.
>
> The PR has been approved and is ready to merge (once it finishes jenkins
> tests).
>
> >
> > >
> > > On Wed, Mar 26, 2025 at 8:04 AM Adam King  wrote:
> > > >
> > > > orch approved. The suite is obviously quite red, but the vast
> majority of the failures are just due to the lack of a proper ignorelist in
> the orch suite on reef.
> > > >
> > > > On Mon, Mar 24, 2025 at 5:40 PM Yuri Weinstein 
> wrote:
> > > >>
> > > >> Details of this release are summarized here:
> > > >>
> > > >> https://tracker.ceph.com/issues/70563#note-1
> > > >> Release Notes - TBD
> > > >> LRC upgrade - TBD
> > > >>
> > > >> Seeking approvals/reviews for:
> > > >>
> > > >> smoke - Laura approved?
> > > >>
> > > >> rados - Radek, Laura approved? Travis?  Nizamudeen? Adam King
> approved?
> > > >>
> > > >> rgw - Adam E approved?
> > > >>
> > > >> fs - Venky is fixing QA suite, will need to be added and rerun
> > > >>
> > > >> orch - Adam King approved?
> > > >>
> > > >> rbd - Ilya approved?
> > > >> krbd - Ilya approved?
> > > >> upgrade-clients:client-upgrade-octopus-reef-reef - Ilya please take
> a look.
> > > >>
> > > >> upgrade/pacific-x (reef) - can this be deprecated?  Josh?  Neha?
> > > >> upgrade/quincy-x (reef) - Laura, Prashant please take a look.
> > > >>
> > > >> ceph-volume - Guillaume is fixing it.
> > > >>
> > > >> TIA
> > > >> ___
> > > >> ceph-users mailing list -- ceph-users@ceph.io
> > > >> To unsubscribe send an email to ceph-users-le...@ceph.io
> > > >>
> > > ___
> > > Dev mailing list -- d...@ceph.io
> > > To unsubscribe send an email to dev-le...@ceph.io
> >
> >
> >
> > --
> > Cheers,
> > Venky
>
>
>
> --
> Cheers,
> Venky
> ___
> Dev mailing list -- d...@ceph.io
> To unsubscribe send an email to dev-le...@ceph.io
>


-- 

Laura Flores

She/Her/Hers

Software Engineer, Ceph Storage 

Chicago, IL

lflo...@ibm.com | lflo...@redhat.com 
M: +17087388804
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Prometheus anomaly in Reef

2025-03-26 Thread Tim Holloway
I finally got brave and migrated from Pacific to Reef, did some banging 
and hammering and for the first time in a long time got a complete 
"HEALTH OK" status.


However, the dashboard is still not happy. It cannot contact the 
Prometheus API on port 9095.


I have redeployed Prometheus multiple times without result.

I'm pretty sure that at one time there were no less than 3 different 
Prometheus containers running on one of the configured Prometheus 
servers, but now all I can get is the node-exporter.


Worse, if I do:

ceph orch reconfig prometheus

I get:

Error EINVAL: No daemons exist under service name "prometheus". View 
currently running services using "ceph orch ls"


But if I do:

ceph orch ls

I get:

prometheus ?:9095   0/1 -  116s  
ceph02.mousetech.com


Suggestions?

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Reef: highly-available NFS with keepalive_only

2025-03-26 Thread Eugen Block
I tried something else, but the result is not really satifying. I  
edited the keepalive.conf files which had no peers at all or only one  
peer, so they were all identical. Restarting the daemons helped having  
only one virtual ip assigned, so now the daemons did communicate and I  
see messages like these:


Master received advert from 192.168.168.112 with same priority 80 but  
higher IP address than ours

Entering BACKUP STATE

So that's good. But powering off the machine with the active nfs  
daemon doesn't provide the expected result. Although keepalive assigns  
the virtual ip to a different host, the failed nfs daemon lands on the  
third node, so mounting is not possible.


To prevent that from happening, I reduced the number of hosts for nfs  
and ingress to two. And that seems to work as expected (after  
modifying the keepalive.conf again). But all in all, the  
keepalive_only option seems a bit too much manual work at this point.


And just a side note: I don't see that a client is connected although  
I am writing data into the nfs export. Both the dashboard and CLI show  
no client:


ceph nfs export info ebl-nfs-cephfs /nfsovercephfs
{
  "access_type": "RW",
  "clients": [],
  "cluster_id": "ebl-nfs-cephfs",
...

I only see the active nfs daemon as a CephFS client.


Zitat von Eugen Block :

Thanks, I removed the ingress service and redeployed it again, with  
the same result. The interesting part here is, the configs are  
identical compared to the previous deployment, so the same peers (or  
no peers) as before.


Zitat von Robert Sander :


Am 3/25/25 um 18:55 schrieb Eugen Block:
Okay, so I don't see anything in the keepalive log about  
communicating between each other. The config files are almost  
identical, no difference in priority, but in unicast_peer. ceph03  
has no entry at all for unicast_peer, ceph02 has only ceph03 in  
there while ceph01 has both of the others in its unicast_peer  
entry. That's weird, isn't it?


They should each have the other two as unicast_peers.
There must have been a glitch in the service generation. Maybe you  
should try to remove it and deploy it as new?


Regards
--
Robert Sander
Linux Consultant

Heinlein Consulting GmbH
Schwedter Str. 8/9b, 10119 Berlin

https://www.heinlein-support.de

Tel: +49 30 405051 - 0
Fax: +49 30 405051 - 19

Amtsgericht Berlin-Charlottenburg - HRB 220009 B
Geschäftsführer: Peer Heinlein - Sitz: Berlin
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Reef: highly-available NFS with keepalive_only

2025-03-26 Thread Eugen Block
Thanks, I removed the ingress service and redeployed it again, with  
the same result. The interesting part here is, the configs are  
identical compared to the previous deployment, so the same peers (or  
no peers) as before.


Zitat von Robert Sander :


Am 3/25/25 um 18:55 schrieb Eugen Block:
Okay, so I don't see anything in the keepalive log about  
communicating between each other. The config files are almost  
identical, no difference in priority, but in unicast_peer. ceph03  
has no entry at all for unicast_peer, ceph02 has only ceph03 in  
there while ceph01 has both of the others in its unicast_peer  
entry. That's weird, isn't it?


They should each have the other two as unicast_peers.
There must have been a glitch in the service generation. Maybe you  
should try to remove it and deploy it as new?


Regards
--
Robert Sander
Linux Consultant

Heinlein Consulting GmbH
Schwedter Str. 8/9b, 10119 Berlin

https://www.heinlein-support.de

Tel: +49 30 405051 - 0
Fax: +49 30 405051 - 19

Amtsgericht Berlin-Charlottenburg - HRB 220009 B
Geschäftsführer: Peer Heinlein - Sitz: Berlin
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Reef: highly-available NFS with keepalive_only

2025-03-26 Thread Robert Sander

Am 3/25/25 um 18:55 schrieb Eugen Block:
Okay, so I don't see anything in the keepalive log about communicating 
between each other. The config files are almost identical, no difference 
in priority, but in unicast_peer. ceph03 has no entry at all for 
unicast_peer, ceph02 has only ceph03 in there while ceph01 has both of 
the others in its unicast_peer entry. That's weird, isn't it?


They should each have the other two as unicast_peers.
There must have been a glitch in the service generation. Maybe you 
should try to remove it and deploy it as new?


Regards
--
Robert Sander
Linux Consultant

Heinlein Consulting GmbH
Schwedter Str. 8/9b, 10119 Berlin

https://www.heinlein-support.de

Tel: +49 30 405051 - 0
Fax: +49 30 405051 - 19

Amtsgericht Berlin-Charlottenburg - HRB 220009 B
Geschäftsführer: Peer Heinlein - Sitz: Berlin
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: reef 18.2.5 QE validation status

2025-03-26 Thread Travis Nielsen
Oh sorry, forget my last email, thanks Laura for pointing out the obvious
that this is for reef, not squid!

On Wed, Mar 26, 2025 at 2:46 PM Travis Nielsen  wrote:

> Yuri, as of when did 18.2.5 include the latest squid branch? If [1] is
> included in 18.2.5, then we really need [2] merged before release, as it
> would be blocking Rook.
>
> [1] https://github.com/ceph/ceph/pull/62095 (merged to squid on March 19)
> [2] https://tracker.ceph.com/issues/70667
>
> Thanks!
> Travis
>
> On Wed, Mar 26, 2025 at 2:09 PM Ilya Dryomov  wrote:
>
>> On Mon, Mar 24, 2025 at 10:40 PM Yuri Weinstein 
>> wrote:
>> >
>> > Details of this release are summarized here:
>> >
>> > https://tracker.ceph.com/issues/70563#note-1
>> > Release Notes - TBD
>> > LRC upgrade - TBD
>> >
>> > Seeking approvals/reviews for:
>> >
>> > smoke - Laura approved?
>> >
>> > rados - Radek, Laura approved? Travis?  Nizamudeen? Adam King approved?
>> >
>> > rgw - Adam E approved?
>> >
>> > fs - Venky is fixing QA suite, will need to be added and rerun
>> >
>> > orch - Adam King approved?
>> >
>> > rbd - Ilya approved?
>> > krbd - Ilya approved?
>>
>> Hi Yuri,
>>
>> rbd and krbd approved.
>>
>> > upgrade-clients:client-upgrade-octopus-reef-reef - Ilya please take a
>> look.
>>
>> I don't recall seeing this before -- try rerunning it a couple of times?
>>
>> Thanks,
>>
>> Ilya
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io