[ceph-users] Re: failed to load OSD map for epoch 2898146, got 0 bytes

2024-10-21 Thread Dan van der Ster
Hi Frank, Do you have some more info about these OSDs -- how long were they down for? Were they down because of some IO errors? Is it possible that the OSD thinks it stored those osdmaps but IO errors are preventing them from being loaded? I know the log is large, but can you share at least a sn

[ceph-users] Re: failed to load OSD map for epoch 2898146, got 0 bytes

2024-10-21 Thread Frank Schilder
Hi Dan, maybe not. Looking at the output of grep -B 1 -e "2971464 failed to load OSD map for epoch 2898132" /var/log/ceph/ceph-osd.1004.log that searches for lines that start a cycle and also print the line before, there might be some progress, but I'm not sure: 2024-10-21T17:41:40.173+0200 7

[ceph-users] Re: failed to load OSD map for epoch 2898146, got 0 bytes

2024-10-21 Thread Vladimir Sigunov
Hi Dan and Frank, From my experience, if an osd was down for a long period of time, it could take more than one manual restart for this osd to catch up an actual epoch. Under manual restart I mean systemctl reset-failed && systemctl restart . The "warm up" time could be up to 15 minutes. Last ti

[ceph-users] CSC Election: Governance amendments and Ceph Executive Council Nominations

2024-10-21 Thread Patrick Donnelly
The Ceph Steering Committee is voting on a number of governance amendments and nominations to the Ceph Executive Council. The election is public and viewable here: https://vote.heliosvoting.org/helios/elections/e03494ce-e04c-41d0-bb05-ec5ccc632ce4/view The election closes on October 28th, 3pm UTC

[ceph-users] Re: failed to load OSD map for epoch 2898146, got 0 bytes

2024-10-21 Thread Dan van der Ster
Hi Frank, Are you sure it's looping over the same epochs? It looks like that old osd is trying to catch up on all the osdmaps it missed while it was down. (And those old maps are probably trimmed from all the mons and osds, based on the "got 0 bytes" error). Eventually it should catch up to the cu

[ceph-users] Re: Ceph RGW performance guidelines

2024-10-21 Thread Anthony D'Atri
> > Not surprising for HDDs. Double your deep-scrub interval. > > Done! If your PG ratio is low, say <200, bumping pg_num may help as well. Oh yeah, looking up your gist from a prior message, you average around 70 PG replicas per OSD. Aim for 200. Your index pool has way too few PGs.

[ceph-users] Re: Ceph RGW performance guidelines

2024-10-21 Thread Harry Kominos
> Not surprising for HDDs. Double your deep-scrub interval. Done! > So you’re relying on the SSD DB device for the index pool? Have you looked at your logs / metrics for those OSDs to see if there is any spillover? > What type of SSD are you using here? And how many HDD OSDs do you have using

[ceph-users] failed to load OSD map for epoch 2898146, got 0 bytes

2024-10-21 Thread Frank Schilder
Hi all, I have a strange problem on an octopus latest cluster. We had a couple of SSD OSDs down for a while and brought them up today again. For some reason, these OSDs don't come up and flood the log with messages like osd.1004 2971464 failed to load OSD map for epoch 2898146, got 0 bytes Th

[ceph-users] Re: Issue with Recovery Throughput Not Visible in Ceph Dashboard After Upgrade to 19.2.0 (Squid)

2024-10-21 Thread mailing-lists
Hey there, i've that problem too, although I got it from updating 17.2.7 to 18.2.4. After i read this mail I've just fiddled around a bit aaand Prometheus does not have ceph_osd_recovery_ops. Then i looked into the files in /var/lib/ceph/xyz/prometheus.node-name/etc/prometheus/prometheus.yml

[ceph-users] Re: [EXTERNAL] Re: How to Speed Up Draining OSDs?

2024-10-21 Thread Alex Hussein-Kershaw (HE/HIM)
My pool size is indeed 3. Operator error 🙂 Thanks again, Alex From: Eugen Block Sent: Monday, October 21, 2024 3:08 PM To: Alex Hussein-Kershaw (HE/HIM) Cc: ceph-users@ceph.io Subject: Re: [EXTERNAL] [ceph-users] Re: How to Speed Up Draining OSDs? If your pool

[ceph-users] Re: [EXTERNAL] Re: How to Speed Up Draining OSDs?

2024-10-21 Thread Eugen Block
If your pool size is three then no, you can't get it to two OSDs. You can check (and paste) 'ceph osd pool ls detail' output to see the current value. (I wouldn't recommend to switch to size 2 except in test clusters.) Zitat von "Alex Hussein-Kershaw (HE/HIM)" : Hi Eugen, Thanks for the

[ceph-users] Re: [EXTERNAL] Re: How to Speed Up Draining OSDs?

2024-10-21 Thread Alex Hussein-Kershaw (HE/HIM)
Hi Eugen, Thanks for the suggestion. I've repeated my attempt with the wpq scheduler (I ran "ceph config set osd osd_op_queue wpq" and restarted all the OSDs). That still seems to be either slow or stuck in a draining state - 10 mins elapsed draining for just a few MB of data. $ ceph orch osd

[ceph-users] Re: How to Speed Up Draining OSDs?

2024-10-21 Thread Eugen Block
Hi, for a production cluster I'd recommend sticking to wpq at the moment, where you can apply "legacy" recovery settings. If you're willing to help the Devs figuring out how to get to the bottom of this, I'm sure they would highly appreciate it. But I currently know too little about mcloc

[ceph-users] How to Speed Up Draining OSDs?

2024-10-21 Thread Alex Hussein-Kershaw (HE/HIM)
Hi Folks, I'm trying to scale-in a Ceph Cluster. It's running 19.2.0 and is cephadm managed. It's just a test system, so has basically no data and only has 3 OSDs. As part of the scaling-in, I run "ceph orch host drain --zap-osd-devices" as per Host Management — Ceph Documentation

[ceph-users] Re: Issue with Recovery Throughput Not Visible in Ceph Dashboard After Upgrade to 19.2.0 (Squid)

2024-10-21 Thread Frédéric Nass
This could be due to a typo in the panel definition in Grafana (comparing the JSON of the working panel with the non-working one might provide more insights) or because the Prometheus Datasource used by Grafana isn't providing any metrics for ceph_osd_recovery_ops. To check the panel in Grafan

[ceph-users] Re: Issue with Recovery Throughput Not Visible in Ceph Dashboard After Upgrade to 19.2.0 (Squid)

2024-10-21 Thread Sanjay Mohan
Hi Frédéric, Thank you for the response. I tried disabling and re-enabling the module, and it seems that the recovery metrics are indeed being collected, but still not displayed on the new dashboard. Interestingly, I have another environment running Ceph 19.2.0 where the recovery throughput is d

[ceph-users] Re: Issue with Recovery Throughput Not Visible in Ceph Dashboard After Upgrade to 19.2.0 (Squid)

2024-10-21 Thread Afreen
Hi all, Thanks for raising it here. We are aware of this looking at it now and will get back to you all. Afreen On Mon, Oct 21, 2024 at 3:27 PM Eugen Block wrote: > Confirmed for 19.2.0 in a lab cluster. > > Zitat von Eugen Bl

[ceph-users] Re: Issue with Recovery Throughput Not Visible in Ceph Dashboard After Upgrade to 19.2.0 (Squid)

2024-10-21 Thread Eugen Block
Confirmed for 19.2.0 in a lab cluster. Zitat von Eugen Block : I see the same as Sanjay, there's no movement in the graph "Recovery Throughput", this is on 19.1.1. If I switch to the classic dashboard though (ceph config set mgr mgr/dashboard/FEATURE_TOGGLE_DASHBOARD false), I see recovery

[ceph-users] Re: Issue with Recovery Throughput Not Visible in Ceph Dashboard After Upgrade to 19.2.0 (Squid)

2024-10-21 Thread Eugen Block
I see the same as Sanjay, there's no movement in the graph "Recovery Throughput", this is on 19.1.1. If I switch to the classic dashboard though (ceph config set mgr mgr/dashboard/FEATURE_TOGGLE_DASHBOARD false), I see recovery traffic in the dashboard as well. I'll upgrade to 19.2.0 and re

[ceph-users] Re: Issue with Recovery Throughput Not Visible in Ceph Dashboard After Upgrade to 19.2.0 (Squid)

2024-10-21 Thread Frédéric Nass
Hi Sanjay, I've just checked the dashboard of a v19.2.0 cluster, and the recovery throughput is displayed correctly, as shown in the screenshot here [1]. You might want to consider redeploying the dashboard. Regards, Frédéric. [1] https://docs.ceph.com/en/latest/mgr/dashboard/ - Le 19 Oct

[ceph-users] Re: Influencing the osd.id when creating or replacing an osd

2024-10-21 Thread Eugen Block
Hi, after you have reqeighted the osd to 0 and waited for the rebalancing to finish, you can just stop the osd process and "purge" the osd instead of marking it out (since data reshuffling already happened). The "purge" command does a couple of things at once, like removing it from the cr