[ceph-users] Re: Production cluster in bad shape after several OSD crashes

2025-03-27 Thread Anthony D'Atri
More likely the problem would just migrate. I suggest `ceph pg repair 32.7ef`. If the situation doesn’t improve within a few minutes, try `ceph osd down 100` > On Mar 27, 2025, at 6:40 AM, Frédéric Nass > wrote: > > - A PG that cannot be reactivated with a remap operation that > does

[ceph-users] Server not responding to keepalive - cephadm 24.04

2025-03-27 Thread Reid Kelley
Setting up a new cluster on fresh ubuntu 24.04 hosts using cephadm. The first 5 hosts all added without issue, but the next N hosts all throw the same error when adding through the dashboard or through ceph orch add... All hosts have ssh access, docker and all base requirements confirmed. Command:

[ceph-users] Re: Ceph orch placement anti affinity

2025-03-27 Thread Anthony D'Atri
My understanding is that anti-affinity will be enforced unless the service spec explicitly allows more than one instance per host. > > Let’s say I have 2 cephfs, and three hosts I want to use as MDS hosts. > > I use ceph orch apply mds to spin up the MDS daemons. > > Is there a way to ensure

[ceph-users] Re: space size issue

2025-03-27 Thread Anthony D'Atri
Look at `ceph osd df`. Is the balancer enabled? > On Mar 27, 2025, at 8:50 AM, Mihai Ciubancan > wrote: > > Hello, > > My name is Mihai, and I have started using CEPH this mount for a HPC cluster. > When was lunch in the production the available space shown was 80TB now is > 16TB and I didn'

[ceph-users] Ceph orch placement anti affinity

2025-03-27 Thread Kasper Rasmussen
Let’s say I have 2 cephfs, and three hosts I want to use as MDS hosts. I use ceph orch apply mds to spin up the MDS daemons. Is there a way to ensure that I don’t get two active MDS running on the same host? I mean when using the ceph orch apply mds command, I can specify —placement, but it on

[ceph-users] Re: reef 18.2.5 QE validation status

2025-03-27 Thread Laura Flores
quincy-x approved: https://tracker.ceph.com/projects/rados/wiki/REEF#v1825-httpstrackercephcomissues70563note-1-upgradequincy-x Asking Radek and Neha about pacific-x. On Thu, Mar 27, 2025 at 9:54 AM Yuri Weinstein wrote: > Venky, Guillaume pls review and approve fs and orch/cepadm > > Still awa

[ceph-users] Re: Prometheus anomaly in Reef

2025-03-27 Thread Tim Holloway
Thanks for your patience. host ceph06 isn't referenced in the config database. I think I've finally purged it. I also reset the dashboard API host address from ceph08 to dell02. But since prometheus isn't running on dell02 either, there's no gain there. I did clear some of that lint out via

[ceph-users] Re: Production cluster in bad shape after several OSD crashes

2025-03-27 Thread Michel Jouvin
Hello, It seems we are at the end of our stressful adventure! After the big rebalancing finished, without errors but without any significant impact on the pool access problem, we decided to reboot all our OSD servers one by one. The first good news is that it cleared all the reported issues (

[ceph-users] Re: reef 18.2.5 QE validation status

2025-03-27 Thread Yuri Weinstein
Venky, Guillaume pls review and approve fs and orch/cepadm Still awaiting arrivals: rados - Travis? Nizamudeen? Adam King approved? rgw - Adam E approved? fs - Venky approved? upgrade-clients:client-upgrade-octopus-reef-reef - Ilya please take a look. There are multiple runs upgrade/pacific

[ceph-users] Re: Production cluster in bad shape after several OSD crashes

2025-03-27 Thread Frédéric Nass
Michel, I can't recall any situations like that - maybe someone here does? - but I would advise that you restart all OSDs to trigger the re-peering of every PG. This should get your cluster back on track. Just make sure the crush map / crush rules / bucket weights (including OSDs weights) hav

[ceph-users] space size issue

2025-03-27 Thread Mihai Ciubancan
Hello, My name is Mihai, and I have started using CEPH this mount for a HPC cluster. When was lunch in the production the available space shown was 80TB now is 16TB and I didn't do anything, while I'm having 12 OSD (SSD of 14TB): sudo ceph osd tree ID CLASS WEIGHT TYPE NAME

[ceph-users] Re: Production cluster in bad shape after several OSD crashes

2025-03-27 Thread Michel Jouvin
Frédéric, When I was writing the last email, my colleague launched a re-peering of the PG in activating state: the PG became active immediately but triggered a little bit of rebalancing of other PGs, not necessarily in the same pool. After this success, we decided to go for your approach, sel

[ceph-users] Re: Prometheus anomaly in Reef

2025-03-27 Thread Tim Holloway
It gets worse. It looks like the physical disk backing the 2 failing OSDs is failing. I destroyed the host for one of them - which causes me to flash-back to the nightmare of having a deleted OSD get permanently stuck deleting just like in Pacific. Because I cannot restart the OSD, the deletio

[ceph-users] Re: Production cluster in bad shape after several OSD crashes

2025-03-27 Thread Michel Jouvin
Frédéric, Thanks for your answer. I checked the number of PG on osd.17: it is 164, very far from the hard limit (750, the default I think). So it doesn't seem to be the problem and may be the peering is a victim of the more general problem leading to many pools to be more or less inaccessible.

[ceph-users] Re: Production cluster in bad shape after several OSD crashes

2025-03-27 Thread Frédéric Nass
Hi Michel, A common reason for PGs being stuck during activation is reaching the hard limit of PGs per OSD. You might want to compare the number of PGs osd.17 has (ceph osd df tree | grep -E 'osd.17 |PGS') to the hard limit set in your cluster (echo "`ceph config get osd.0 mon_max_pg_per_osd`*`

[ceph-users] Re: Device missing from "ceph device ls"

2025-03-27 Thread Torkil Svensgaard
On 27/03/2025 10:10, Torkil Svensgaard wrote: Hi 19.2.1 " [root@franky ~]# ceph device ls | grep franky ATA_HGST_HDN726060ALE614_K1GV9P4B  franky:sda osd.579   now ATA_HGST_HDS724040ALE640_PK1334PBH7PZ5P    franky:sdn osd.577  

[ceph-users] Re: Production cluster in bad shape after several OSD crashes

2025-03-27 Thread Michel Jouvin
Hi, I have not seen an answer yet, help would be very much appreciated as our production cluster seems in a worst shape that initially described... After a deeper analysis, we found that more than half of the pools, despite reported as ok, are not accessible: the 'rados ls' command is stuck

[ceph-users] Device missing from "ceph device ls"

2025-03-27 Thread Torkil Svensgaard
Hi 19.2.1 " [root@franky ~]# ceph device ls | grep franky ATA_HGST_HDN726060ALE614_K1GV9P4B franky:sda osd.579 now ATA_HGST_HDS724040ALE640_PK1334PBH7PZ5Pfranky:sdn osd.577 now ATA_HGST_HDS72