[ceph-users] Re: reef 18.2.5 QE validation status

2025-03-26 Thread Travis Nielsen
Oh sorry, forget my last email, thanks Laura for pointing out the obvious that this is for reef, not squid! On Wed, Mar 26, 2025 at 2:46 PM Travis Nielsen wrote: > Yuri, as of when did 18.2.5 include the latest squid branch? If [1] is > included in 18.2.5, then we really need [2] merged before r

[ceph-users] Re: Ceph orch placement - anti affinity

2025-03-26 Thread Eugen Block
If you don't specify "count_per_host", the orchestrator won't deploy multiple daemons on one host. There's no way (that I'm aware of) to specify a primary daemon. Since standby daemons need to be able to take over the workload, they should be all equally equipped. Zitat von Kasper Rasmussen

[ceph-users] Re: Prometheus anomaly in Reef

2025-03-26 Thread Tim Holloway
service_type: prometheus service_name: prometheus placement:   hosts:   - dell02.mousetech.com networks: - 10.0.1.0/24 Can't list daemon logs, run restart usw., because "Error EINVAL: No daemons exist under service name "prometheus". View currently running services using "ceph orch ls"" And y

[ceph-users] Re: Prometheus anomaly in Reef

2025-03-26 Thread Eugen Block
Then maybe the deployment did fail and we’re back at looking into the cephadm.log. Zitat von Tim Holloway : it returns nothing. I'd already done the same via "systemctl | grep prometheus". There simply isn't a systemd service, even though there should be. On 3/26/25 11:31, Eugen Block w

[ceph-users] Re: Prometheus anomaly in Reef

2025-03-26 Thread Eugen Block
If you need a proxy to pull the images, I suggest to set it in the containers.conf: cat /etc/containers/containers.conf [engine] env = ["http_proxy=:", "https_proxy=:", "no_proxy="] But again, you should be able to see a failed to pull in the cephadm.log on dell02. Or even in 'ceph health

[ceph-users] Re: Prometheus anomaly in Reef

2025-03-26 Thread Eugen Block
That would be the correct log file, but I don't see an attempt to deploy a prometheus instance there. You can use any pastebin you like, e. g. https://pastebin.com/ to upload your logs. Mask any sensitive data before you do that. Zitat von Tim Holloway : Well, here's an excerpt from the /

[ceph-users] Re: Prometheus anomaly in Reef

2025-03-26 Thread Tim Holloway
I don't think there is failure to deploy. For one thing, I did have, as mentioned 3 Prometheus-related containers running at one point on the machine. Also checked for port issues and there are none. Nothing listens on 9095. One thing that does concern me is that the docs sau changes in settin

[ceph-users] Re: Prometheus anomaly in Reef

2025-03-26 Thread Eugen Block
There’s a service called „prometheus“, which can have multiple daemons, just like any other service (mon, mgr etc). To get the daemon logs you need to provide the daemon name (prometheus.ceph02.andsopn), not just the service name (prometheus). Can you run the cephadm command I provided? It

[ceph-users] Re: Prometheus anomaly in Reef

2025-03-26 Thread Tim Holloway
Since the containers are all podman, I found a "systemctl edit podman" command that's recommended to set proxy for that. However, once I did, 2 OSDs went down and cannot be restarted. In any event, before I did that, ceph health detail was returning "HEALTH OK". Now I'm getting this: HEALTH

[ceph-users] Re: Prometheus anomaly in Reef

2025-03-26 Thread Tim Holloway
Also, here are the currently-installed container images: [root@dell02 ~]# podman image ls REPOSITORY    TAG IMAGE ID CREATED    SIZE quay.io/ceph/ceph   2bc0b0f4375d 8 months ago   1.25 GB quay.io/ceph/ceph   3c4eff6082ae 10

[ceph-users] Ceph orch placement - anti affinity

2025-03-26 Thread Kasper Rasmussen
Let’s say I have 2 cephfs, and three hosts I want to use as MDS hosts. I use ceph orch apply mds to spin up the MDS daemons. Is there a way to ensure that I don’t get two active MDS running on the same host? I mean when using the ceph orch apply mds command, I can specify —placement, but it on

[ceph-users] Re: Prometheus anomaly in Reef

2025-03-26 Thread Eugen Block
The cephadm.log should show some details why it fails to deploy the daemon. If there's not much, look into the daemon logs as well (cephadm logs --name prometheus.ceph02.mousetech.com). Could it be that there's a non-cephadm prometheus already listening on port 9095? Zitat von Tim Holloway

[ceph-users] Re: Prometheus anomaly in Reef

2025-03-26 Thread Eugen Block
Can you share 'ceph orch ls prometheus --export'? And if it has been deployed successfully but is currently not running, the logs should show why that is the case. To restart prometheus, you can just run this to restart the entire prometheus service (which would include all instances if you

[ceph-users] Re: Prometheus anomaly in Reef

2025-03-26 Thread Tim Holloway
Well, here's an excerpt from the /var/log/ceph/cephadm.log. I don't know if that's the mechanism or file you mean, though. 2025-03-26 13:11:09,382 7fb2abc38740 DEBUG cephadm ['--no-container-init', '--timeout', '

[ceph-users] Re: reef 18.2.5 QE validation status

2025-03-26 Thread Yuri Weinstein
I added a run and rerun for the fs suite on a fix https://github.com/ceph/ceph/pull/62492 Venky, pls review and if approved I will merge it to reef and cherry-pick to the release branch. On Wed, Mar 26, 2025 at 8:04 AM Adam King wrote: > > orch approved. The suite is obviously quite red, but the

[ceph-users] Re: Prometheus anomaly in Reef

2025-03-26 Thread Eugen Block
Right, systemctl edit works as well. But I'm confused about the down OSDs. Did you set the proxy on all hosts? Because the down OSDs are on ceph06 while prometheus is supposed to run on dell02. Are you sure those are related? I would recommend to remove the prometheus service entirely and s

[ceph-users] Re: Prometheus anomaly in Reef

2025-03-26 Thread Tim Holloway
It's strange, but for a while I'd been trying to get prometheus working on ceph08, so I don't know. All I do know is immediately after editing the proxy settings I got indications that those 2 OSDs had gone down. What's REALLY strange is that their logs seem to hint that somehow they shifted

[ceph-users] Re: Prometheus anomaly in Reef

2025-03-26 Thread Tim Holloway
No change. On 3/26/25 13:01, Tim Holloway wrote: It's strange, but for a while I'd been trying to get prometheus working on ceph08, so I don't know. All I do know is immediately after editing the proxy settings I got indications that those 2 OSDs had gone down. What's REALLY strange is that

[ceph-users] Re: reef 18.2.5 QE validation status

2025-03-26 Thread Venky Shankar
On Wed, Mar 26, 2025 at 8:37 PM Yuri Weinstein wrote: > > I added a run and rerun for the fs suite on a fix > https://github.com/ceph/ceph/pull/62492 > > Venky, pls review and if approved I will merge it to reef and > cherry-pick to the release branch. Noted. I will let you know when it's ready t

[ceph-users] Re: reef 18.2.5 QE validation status

2025-03-26 Thread Venky Shankar
Hi Yuri, On Wed, Mar 26, 2025 at 8:59 PM Venky Shankar wrote: > > On Wed, Mar 26, 2025 at 8:37 PM Yuri Weinstein wrote: > > > > I added a run and rerun for the fs suite on a fix > > https://github.com/ceph/ceph/pull/62492 > > > > Venky, pls review and if approved I will merge it to reef and > >

[ceph-users] Re: Prometheus anomaly in Reef

2025-03-26 Thread Eugen Block
Ok, I'll try one last time and ask for cephadm.log output. ;-) And the active MGR's log might help here as well. Zitat von Tim Holloway : No change. On 3/26/25 13:01, Tim Holloway wrote: It's strange, but for a while I'd been trying to get prometheus working on ceph08, so I don't know. A

[ceph-users] Re: reef 18.2.5 QE validation status

2025-03-26 Thread Yuri Weinstein
Ack, Travis I was about to reply the same. Venky, Guillaume the PRs below were cherry-picked I will rerun the fs and ceph-volume tests when the build is done https://github.com/ceph/ceph/pull/62492/commits https://github.com/ceph/ceph/pull/62178/commits On Wed, Mar 26, 2025 at 2:20 PM Travis Nie

[ceph-users] Re: Prometheus anomaly in Reef

2025-03-26 Thread Tim Holloway
Sorry, duplicated a URL. The mgr log is https://www.mousetech.com/share/ceph-mgr.log ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Production cluster in bad shape after several OSD crashes

2025-03-26 Thread Michel Jouvin
Hi, We have a production cluster made of 3 mon+mgr, 18 OSD servers and ~500 OSDs and configured with ~50 pools, 1/2 EC (9+6) and 1/2 replica 3. It also has 2 CephFS filesystems with 1 MDS each. 2 days ago, in a period spanning 16 hours, 13 OSD crashed with an OOM. The OSD were first restarte

[ceph-users] Re: Production cluster in bad shape after several OSD crashes

2025-03-26 Thread Michel Jouvin
Hi again, Looking for more info on the degraded filesystem, I managed to connect to the dashboard where I see an error not reported as explicitely by 'ceph health' : One or more metadata daemons (MDS ranks) are failed or in a damaged state. At best the filesystem is partially available, at w

[ceph-users] Re: Prometheus anomaly in Reef

2025-03-26 Thread Tim Holloway
OSD mystery is solved. Both OSDs were LVM-based imported as vdisks for Ceph VMs. Apparently something scrambled either the VM manager or the host disk subsystem as the VM disks were getting I/O errors and even disappearing from the VM. I rebooted the physical machine and that cleared it. All

[ceph-users] Re: Production cluster in bad shape after several OSD crashes

2025-03-26 Thread Michel Jouvin
And sorry for all these mails, I forgot to mention that we are running 18.2.2. Michel Le 26/03/2025 à 21:51, Michel Jouvin a écrit : Hi again, Looking for more info on the degraded filesystem, I managed to connect to the dashboard where I see an error not reported as explicitely by 'ceph he

[ceph-users] Re: Prometheus anomaly in Reef

2025-03-26 Thread Tim Holloway
OK. I couldn't find a quick way to shovel a largish file from an internal server into pastebin, but my own servers can suffice. the URLs are: https://www.mousetech.com/share/cephadm.log https://www.mousetech.com/share/cephadm.log And I don't see a deployment either. On 3/26/25 14:26, Eugen

[ceph-users] Re: reef 18.2.5 QE validation status

2025-03-26 Thread Ilya Dryomov
On Mon, Mar 24, 2025 at 10:40 PM Yuri Weinstein wrote: > > Details of this release are summarized here: > > https://tracker.ceph.com/issues/70563#note-1 > Release Notes - TBD > LRC upgrade - TBD > > Seeking approvals/reviews for: > > smoke - Laura approved? > > rados - Radek, Laura approved? Travi

[ceph-users] Re: reef 18.2.5 QE validation status

2025-03-26 Thread Guillaume ABRIOUX
Hi Yuri, ceph-volume is missing this backport [1]. Also for this release you will need to run the teuthology orch/cephadm test suite for validating ceph-volume rather than the usual "ceph-volume functional test suite" [2] [1] https://github.com/ceph/ceph/pull/62178 [2] https://jenkins.ceph.com/

[ceph-users] Re: reef 18.2.5 QE validation status

2025-03-26 Thread Travis Nielsen
Yuri, as of when did 18.2.5 include the latest squid branch? If [1] is included in 18.2.5, then we really need [2] merged before release, as it would be blocking Rook. [1] https://github.com/ceph/ceph/pull/62095 (merged to squid on March 19) [2] https://tracker.ceph.com/issues/70667 Thanks! Travi

[ceph-users] Re: reef 18.2.5 QE validation status

2025-03-26 Thread Laura Flores
Rados approved: https://tracker.ceph.com/projects/rados/wiki/REEF#v1825-httpstrackercephcomissues70563note-1 On Wed, Mar 26, 2025 at 12:22 PM Venky Shankar wrote: > Hi Yuri, > > On Wed, Mar 26, 2025 at 8:59 PM Venky Shankar wrote: > > > > On Wed, Mar 26, 2025 at 8:37 PM Yuri Weinstein > wrote:

[ceph-users] Prometheus anomaly in Reef

2025-03-26 Thread Tim Holloway
I finally got brave and migrated from Pacific to Reef, did some banging and hammering and for the first time in a long time got a complete "HEALTH OK" status. However, the dashboard is still not happy. It cannot contact the Prometheus API on port 9095. I have redeployed Prometheus multiple t

[ceph-users] Re: Reef: highly-available NFS with keepalive_only

2025-03-26 Thread Eugen Block
I tried something else, but the result is not really satifying. I edited the keepalive.conf files which had no peers at all or only one peer, so they were all identical. Restarting the daemons helped having only one virtual ip assigned, so now the daemons did communicate and I see messages

[ceph-users] Re: Reef: highly-available NFS with keepalive_only

2025-03-26 Thread Eugen Block
Thanks, I removed the ingress service and redeployed it again, with the same result. The interesting part here is, the configs are identical compared to the previous deployment, so the same peers (or no peers) as before. Zitat von Robert Sander : Am 3/25/25 um 18:55 schrieb Eugen Block: O

[ceph-users] Re: Reef: highly-available NFS with keepalive_only

2025-03-26 Thread Robert Sander
Am 3/25/25 um 18:55 schrieb Eugen Block: Okay, so I don't see anything in the keepalive log about communicating between each other. The config files are almost identical, no difference in priority, but in unicast_peer. ceph03 has no entry at all for unicast_peer, ceph02 has only ceph03 in there