[ceph-users] Re: Enterprise SSD/NVME

2025-01-13 Thread Adam Prycki
Hi, in general you need SSD with power loss protection (big capacitors) Ceph uses sync to flush data to drive before confirming the write. SSDs with PLP can lie a bit and confirm write faster because PLP should guarantee that data in cache will be written to flash. This post is about journal

[ceph-users] Re: Adding Rack to crushmap - Rebalancing multiple PB of data - advice/experience

2025-01-13 Thread Joshua Baergen
Note that 'norebalance' disables the balancer but doesn't prevent backfill; you'll want to set 'nobackfill' as well. Josh On Sun, Jan 12, 2025 at 1:49 PM Anthony D'Atri wrote: > > [ ed: snag during moderation (somehow a newline was interpolated in the > Subject), so I’m sending this on behalf o

[ceph-users] Re: Modify or override ceph_default_alerts.yml

2025-01-13 Thread Devin A. Bougie
Hi Eugen, No, as far as I can tell I only have one prometheus service running. ——— [root@cephman2 ~]# ceph orch ls prometheus --export service_type: prometheus service_name: prometheus placement: count: 1 label: _admin [root@cephman2 ~]# ceph orch ps --daemon-type prometheus NAME

[ceph-users] Re: Snaptriming speed degrade with pg increase

2025-01-13 Thread Szabo, Istvan (Agoda)
Hi, Quick update on this topic, seems to be the solution for us to offline compact all osds. After that all snaptrimming can finish in an hour rather than a day. From: Szabo, Istvan (Agoda) Sent: Friday, November 29, 2024 2:31:33 PM To: Bandelow, Gunnar ; Ceph U

[ceph-users] Re: Help needed, ceph fs down due to large stray dir

2025-01-13 Thread Frank Schilder
Dear all, a quick update and some answers. We set up a dedicated host for running an MDS and debugging the problem. On this host we have 750G RAM, 4T swap and 4T log, both on fast SSDs. Plan is to monitor with "perf top" the MDS becoming the designated MDS for the problematic rank and also pull

[ceph-users] Re: Modify or override ceph_default_alerts.yml

2025-01-13 Thread Devin A. Bougie
Thanks, Eugen! Just incase you have any more suggestions, this still isn’t quite working for us. Perhaps one clue is that in the Alerts view of the cephadm dashboard, every alert is listed twice. We see two CephPGImbalance alerts, both set to 30% after redeploying the service. If I then foll

[ceph-users] Re: Modify or override ceph_default_alerts.yml

2025-01-13 Thread Eugen Block
Do you have two Prometheus instances? Maybe you could share ceph orch ls prometheus --export Or alternatively: ceph orch ps --daemon-type prometheus You can use two instances for HA, but then you need to change the threshold for both, of course. Zitat von "Devin A. Bougie" : Thanks, Eugen!

[ceph-users] Re: Slow initial boot of OSDs in large cluster with unclean state

2025-01-13 Thread Thomas Byrne - STFC UKRI
Thanks for the input Josh. I actually started looking into this was because we're adding some SSD OSDs to this cluster, and they were basically as slow on their initial boot as HDD OSDs when the cluster hasn't trimmed OSDmaps in a while. I'd be interested to know if other people seeing this slo

[ceph-users] Cephfs mds not trimming after cluster outage

2025-01-13 Thread Adam Prycki
Hello, we are having issues with cephfs cluster. Any help would be appreciated. We are running still on 18.2.0. During holidays we had outage caused by filling up rootfs. OSDs started randomly dying and we had time when not all PGs were active. This issue is already solved and all OSDs work fin

[ceph-users] Re: Modify or override ceph_default_alerts.yml

2025-01-13 Thread Eugen Block
Ah, I checked on a newer test cluster (Squid) and now I see what you mean. The alert is shown per OSD in the dashboard, if you open the dropdown you see which daemons are affected. I think it works a bit different in Pacific (that's what the customer is still running) when I last had to mod

[ceph-users] Re: Help needed, ceph fs down due to large stray dir

2025-01-13 Thread Frédéric Nass
Hi Frank, More than ever. You should open a tracker and post debug logs there so anyone can have a look. Regards, Frédéric. De : Frank Schilder Envoyé : lundi 13 janvier 2025 17:39 À : ceph-users@ceph.io Cc: Dan van der Ster; Patrick Donnelly; Bailey Allison; Sp

[ceph-users] Re: OSDs won't come back after upgrade

2025-01-13 Thread Jorge Garcia
I'm not sure, but I think it has to do with network communication between the OSDs? In any case, you probably can make it work with selinux with the appropriate settings, but I wasn't using selinux before, so disabling it was the easiest solution. On Sat, Jan 11, 2025 at 1:24 PM Alvaro Soto wrote