date:20250326

[ceph-users] Re: reef 18.2.5 QE validation status

2025-03-26 Thread Travis Nielsen

Oh sorry, forget my last email, thanks Laura for pointing out the obvious that this is for reef, not squid! On Wed, Mar 26, 2025 at 2:46 PM Travis Nielsen wrote: > Yuri, as of when did 18.2.5 include the latest squid branch? If [1] is > included in 18.2.5, then we really need [2] merged before r

[ceph-users] Re: Ceph orch placement - anti affinity

2025-03-26 Thread Eugen Block

If you don't specify "count_per_host", the orchestrator won't deploy multiple daemons on one host. There's no way (that I'm aware of) to specify a primary daemon. Since standby daemons need to be able to take over the workload, they should be all equally equipped. Zitat von Kasper Rasmussen

[ceph-users] Re: Prometheus anomaly in Reef

2025-03-26 Thread Tim Holloway

service_type: prometheus service_name: prometheus placement: hosts: - dell02.mousetech.com networks: - 10.0.1.0/24 Can't list daemon logs, run restart usw., because "Error EINVAL: No daemons exist under service name "prometheus". View currently running services using "ceph orch ls"" And y

[ceph-users] Re: Prometheus anomaly in Reef

2025-03-26 Thread Eugen Block

Then maybe the deployment did fail and we’re back at looking into the cephadm.log. Zitat von Tim Holloway : it returns nothing. I'd already done the same via "systemctl | grep prometheus". There simply isn't a systemd service, even though there should be. On 3/26/25 11:31, Eugen Block w

[ceph-users] Re: Prometheus anomaly in Reef

2025-03-26 Thread Eugen Block

If you need a proxy to pull the images, I suggest to set it in the containers.conf: cat /etc/containers/containers.conf [engine] env = ["http_proxy=:", "https_proxy=:", "no_proxy="] But again, you should be able to see a failed to pull in the cephadm.log on dell02. Or even in 'ceph health

[ceph-users] Re: Prometheus anomaly in Reef

2025-03-26 Thread Eugen Block

That would be the correct log file, but I don't see an attempt to deploy a prometheus instance there. You can use any pastebin you like, e. g. https://pastebin.com/ to upload your logs. Mask any sensitive data before you do that. Zitat von Tim Holloway : Well, here's an excerpt from the /

[ceph-users] Re: Prometheus anomaly in Reef

2025-03-26 Thread Tim Holloway

I don't think there is failure to deploy. For one thing, I did have, as mentioned 3 Prometheus-related containers running at one point on the machine. Also checked for port issues and there are none. Nothing listens on 9095. One thing that does concern me is that the docs sau changes in settin

[ceph-users] Re: Prometheus anomaly in Reef

2025-03-26 Thread Eugen Block

There’s a service called „prometheus“, which can have multiple daemons, just like any other service (mon, mgr etc). To get the daemon logs you need to provide the daemon name (prometheus.ceph02.andsopn), not just the service name (prometheus). Can you run the cephadm command I provided? It

[ceph-users] Re: Prometheus anomaly in Reef

2025-03-26 Thread Tim Holloway

Since the containers are all podman, I found a "systemctl edit podman" command that's recommended to set proxy for that. However, once I did, 2 OSDs went down and cannot be restarted. In any event, before I did that, ceph health detail was returning "HEALTH OK". Now I'm getting this: HEALTH

[ceph-users] Re: Prometheus anomaly in Reef

2025-03-26 Thread Tim Holloway

Also, here are the currently-installed container images: [root@dell02 ~]# podman image ls REPOSITORY TAG IMAGE ID CREATED SIZE quay.io/ceph/ceph 2bc0b0f4375d 8 months ago 1.25 GB quay.io/ceph/ceph 3c4eff6082ae 10

[ceph-users] Ceph orch placement - anti affinity

2025-03-26 Thread Kasper Rasmussen

Let’s say I have 2 cephfs, and three hosts I want to use as MDS hosts. I use ceph orch apply mds to spin up the MDS daemons. Is there a way to ensure that I don’t get two active MDS running on the same host? I mean when using the ceph orch apply mds command, I can specify —placement, but it on

[ceph-users] Re: Prometheus anomaly in Reef

2025-03-26 Thread Eugen Block

The cephadm.log should show some details why it fails to deploy the daemon. If there's not much, look into the daemon logs as well (cephadm logs --name prometheus.ceph02.mousetech.com). Could it be that there's a non-cephadm prometheus already listening on port 9095? Zitat von Tim Holloway

[ceph-users] Re: Prometheus anomaly in Reef

2025-03-26 Thread Eugen Block

Can you share 'ceph orch ls prometheus --export'? And if it has been deployed successfully but is currently not running, the logs should show why that is the case. To restart prometheus, you can just run this to restart the entire prometheus service (which would include all instances if you

[ceph-users] Re: Prometheus anomaly in Reef

2025-03-26 Thread Tim Holloway

Well, here's an excerpt from the /var/log/ceph/cephadm.log. I don't know if that's the mechanism or file you mean, though. 2025-03-26 13:11:09,382 7fb2abc38740 DEBUG cephadm ['--no-container-init', '--timeout', '

[ceph-users] Re: reef 18.2.5 QE validation status

2025-03-26 Thread Yuri Weinstein

I added a run and rerun for the fs suite on a fix https://github.com/ceph/ceph/pull/62492 Venky, pls review and if approved I will merge it to reef and cherry-pick to the release branch. On Wed, Mar 26, 2025 at 8:04 AM Adam King wrote: > > orch approved. The suite is obviously quite red, but the

[ceph-users] Re: Prometheus anomaly in Reef

2025-03-26 Thread Eugen Block

Right, systemctl edit works as well. But I'm confused about the down OSDs. Did you set the proxy on all hosts? Because the down OSDs are on ceph06 while prometheus is supposed to run on dell02. Are you sure those are related? I would recommend to remove the prometheus service entirely and s

[ceph-users] Re: Prometheus anomaly in Reef

2025-03-26 Thread Tim Holloway

It's strange, but for a while I'd been trying to get prometheus working on ceph08, so I don't know. All I do know is immediately after editing the proxy settings I got indications that those 2 OSDs had gone down. What's REALLY strange is that their logs seem to hint that somehow they shifted

[ceph-users] Re: Prometheus anomaly in Reef

2025-03-26 Thread Tim Holloway

No change. On 3/26/25 13:01, Tim Holloway wrote: It's strange, but for a while I'd been trying to get prometheus working on ceph08, so I don't know. All I do know is immediately after editing the proxy settings I got indications that those 2 OSDs had gone down. What's REALLY strange is that

[ceph-users] Re: reef 18.2.5 QE validation status

2025-03-26 Thread Venky Shankar

On Wed, Mar 26, 2025 at 8:37 PM Yuri Weinstein wrote: > > I added a run and rerun for the fs suite on a fix > https://github.com/ceph/ceph/pull/62492 > > Venky, pls review and if approved I will merge it to reef and > cherry-pick to the release branch. Noted. I will let you know when it's ready t

[ceph-users] Re: reef 18.2.5 QE validation status

2025-03-26 Thread Venky Shankar

Hi Yuri, On Wed, Mar 26, 2025 at 8:59 PM Venky Shankar wrote: > > On Wed, Mar 26, 2025 at 8:37 PM Yuri Weinstein wrote: > > > > I added a run and rerun for the fs suite on a fix > > https://github.com/ceph/ceph/pull/62492 > > > > Venky, pls review and if approved I will merge it to reef and > >

[ceph-users] Re: Prometheus anomaly in Reef

2025-03-26 Thread Eugen Block

Ok, I'll try one last time and ask for cephadm.log output. ;-) And the active MGR's log might help here as well. Zitat von Tim Holloway : No change. On 3/26/25 13:01, Tim Holloway wrote: It's strange, but for a while I'd been trying to get prometheus working on ceph08, so I don't know. A

[ceph-users] Re: reef 18.2.5 QE validation status

2025-03-26 Thread Yuri Weinstein

Ack, Travis I was about to reply the same. Venky, Guillaume the PRs below were cherry-picked I will rerun the fs and ceph-volume tests when the build is done https://github.com/ceph/ceph/pull/62492/commits https://github.com/ceph/ceph/pull/62178/commits On Wed, Mar 26, 2025 at 2:20 PM Travis Nie

[ceph-users] Re: Prometheus anomaly in Reef

2025-03-26 Thread Tim Holloway

Sorry, duplicated a URL. The mgr log is https://www.mousetech.com/share/ceph-mgr.log ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Production cluster in bad shape after several OSD crashes

2025-03-26 Thread Michel Jouvin

Hi, We have a production cluster made of 3 mon+mgr, 18 OSD servers and ~500 OSDs and configured with ~50 pools, 1/2 EC (9+6) and 1/2 replica 3. It also has 2 CephFS filesystems with 1 MDS each. 2 days ago, in a period spanning 16 hours, 13 OSD crashed with an OOM. The OSD were first restarte

[ceph-users] Re: Production cluster in bad shape after several OSD crashes

2025-03-26 Thread Michel Jouvin

Hi again, Looking for more info on the degraded filesystem, I managed to connect to the dashboard where I see an error not reported as explicitely by 'ceph health' : One or more metadata daemons (MDS ranks) are failed or in a damaged state. At best the filesystem is partially available, at w

[ceph-users] Re: Prometheus anomaly in Reef

2025-03-26 Thread Tim Holloway

OSD mystery is solved. Both OSDs were LVM-based imported as vdisks for Ceph VMs. Apparently something scrambled either the VM manager or the host disk subsystem as the VM disks were getting I/O errors and even disappearing from the VM. I rebooted the physical machine and that cleared it. All

[ceph-users] Re: Production cluster in bad shape after several OSD crashes

2025-03-26 Thread Michel Jouvin

And sorry for all these mails, I forgot to mention that we are running 18.2.2. Michel Le 26/03/2025 à 21:51, Michel Jouvin a écrit : Hi again, Looking for more info on the degraded filesystem, I managed to connect to the dashboard where I see an error not reported as explicitely by 'ceph he

[ceph-users] Re: Prometheus anomaly in Reef

2025-03-26 Thread Tim Holloway

OK. I couldn't find a quick way to shovel a largish file from an internal server into pastebin, but my own servers can suffice. the URLs are: https://www.mousetech.com/share/cephadm.log https://www.mousetech.com/share/cephadm.log And I don't see a deployment either. On 3/26/25 14:26, Eugen

[ceph-users] Re: reef 18.2.5 QE validation status

2025-03-26 Thread Ilya Dryomov

On Mon, Mar 24, 2025 at 10:40 PM Yuri Weinstein wrote: > > Details of this release are summarized here: > > https://tracker.ceph.com/issues/70563#note-1 > Release Notes - TBD > LRC upgrade - TBD > > Seeking approvals/reviews for: > > smoke - Laura approved? > > rados - Radek, Laura approved? Travi

[ceph-users] Re: reef 18.2.5 QE validation status

2025-03-26 Thread Guillaume ABRIOUX

Hi Yuri, ceph-volume is missing this backport [1]. Also for this release you will need to run the teuthology orch/cephadm test suite for validating ceph-volume rather than the usual "ceph-volume functional test suite" [2] [1] https://github.com/ceph/ceph/pull/62178 [2] https://jenkins.ceph.com/

[ceph-users] Re: reef 18.2.5 QE validation status

2025-03-26 Thread Travis Nielsen

Yuri, as of when did 18.2.5 include the latest squid branch? If [1] is included in 18.2.5, then we really need [2] merged before release, as it would be blocking Rook. [1] https://github.com/ceph/ceph/pull/62095 (merged to squid on March 19) [2] https://tracker.ceph.com/issues/70667 Thanks! Travi

[ceph-users] Re: reef 18.2.5 QE validation status

2025-03-26 Thread Laura Flores

Rados approved: https://tracker.ceph.com/projects/rados/wiki/REEF#v1825-httpstrackercephcomissues70563note-1 On Wed, Mar 26, 2025 at 12:22 PM Venky Shankar wrote: > Hi Yuri, > > On Wed, Mar 26, 2025 at 8:59 PM Venky Shankar wrote: > > > > On Wed, Mar 26, 2025 at 8:37 PM Yuri Weinstein > wrote:

[ceph-users] Prometheus anomaly in Reef

2025-03-26 Thread Tim Holloway

I finally got brave and migrated from Pacific to Reef, did some banging and hammering and for the first time in a long time got a complete "HEALTH OK" status. However, the dashboard is still not happy. It cannot contact the Prometheus API on port 9095. I have redeployed Prometheus multiple t

[ceph-users] Re: Reef: highly-available NFS with keepalive_only

2025-03-26 Thread Eugen Block

I tried something else, but the result is not really satifying. I edited the keepalive.conf files which had no peers at all or only one peer, so they were all identical. Restarting the daemons helped having only one virtual ip assigned, so now the daemons did communicate and I see messages

[ceph-users] Re: Reef: highly-available NFS with keepalive_only

2025-03-26 Thread Eugen Block

Thanks, I removed the ingress service and redeployed it again, with the same result. The interesting part here is, the configs are identical compared to the previous deployment, so the same peers (or no peers) as before. Zitat von Robert Sander : Am 3/25/25 um 18:55 schrieb Eugen Block: O

[ceph-users] Re: Reef: highly-available NFS with keepalive_only

2025-03-26 Thread Robert Sander

Am 3/25/25 um 18:55 schrieb Eugen Block: Okay, so I don't see anything in the keepalive log about communicating between each other. The config files are almost identical, no difference in priority, but in unicast_peer. ceph03 has no entry at all for unicast_peer, ceph02 has only ceph03 in there

[ceph-users] Re: reef 18.2.5 QE validation status

[ceph-users] Re: Ceph orch placement - anti affinity

[ceph-users] Re: Prometheus anomaly in Reef

[ceph-users] Re: Prometheus anomaly in Reef

[ceph-users] Re: Prometheus anomaly in Reef

[ceph-users] Re: Prometheus anomaly in Reef

[ceph-users] Re: Prometheus anomaly in Reef

[ceph-users] Re: Prometheus anomaly in Reef

[ceph-users] Re: Prometheus anomaly in Reef

[ceph-users] Re: Prometheus anomaly in Reef

[ceph-users] Ceph orch placement - anti affinity

[ceph-users] Re: Prometheus anomaly in Reef

[ceph-users] Re: Prometheus anomaly in Reef

[ceph-users] Re: Prometheus anomaly in Reef

[ceph-users] Re: reef 18.2.5 QE validation status

[ceph-users] Re: Prometheus anomaly in Reef

[ceph-users] Re: Prometheus anomaly in Reef

[ceph-users] Re: Prometheus anomaly in Reef

[ceph-users] Re: reef 18.2.5 QE validation status

[ceph-users] Re: reef 18.2.5 QE validation status

[ceph-users] Re: Prometheus anomaly in Reef

[ceph-users] Re: reef 18.2.5 QE validation status

[ceph-users] Re: Prometheus anomaly in Reef

[ceph-users] Production cluster in bad shape after several OSD crashes

[ceph-users] Re: Production cluster in bad shape after several OSD crashes

[ceph-users] Re: Prometheus anomaly in Reef

[ceph-users] Re: Production cluster in bad shape after several OSD crashes

[ceph-users] Re: Prometheus anomaly in Reef

[ceph-users] Re: reef 18.2.5 QE validation status

[ceph-users] Re: reef 18.2.5 QE validation status

[ceph-users] Re: reef 18.2.5 QE validation status

[ceph-users] Re: reef 18.2.5 QE validation status

[ceph-users] Prometheus anomaly in Reef

[ceph-users] Re: Reef: highly-available NFS with keepalive_only

[ceph-users] Re: Reef: highly-available NFS with keepalive_only

[ceph-users] Re: Reef: highly-available NFS with keepalive_only

36 matches

Site Navigation

Mail list logo

Footer information