[ceph-users] Squid and Ganesha and ingress options?

2025-05-05 Thread Nigel Williams
We deployed a new Cephadm Squid cluster (400 OSDs) and are happy with everything except this time we tried Ganesha again with the ingress option (HAproxy), but soon we ran into a failed daemon event. And there is a github issue for it: https://github.com/nfs-ganesha/nfs-ganesha/issues/1158 Our qu

[ceph-users] Re: Question about PR merge

2024-04-17 Thread Nigel Williams
Hi Xiubo, Is the issue we provided logs on the same as Erich or is that a third different locking issue? thanks, nigel. On Thu, 18 Apr 2024 at 12:29, Xiubo Li wrote: > > On 4/18/24 08:57, Erich Weiler wrote: > >> Have you already shared information about this issue? Please do if not. > > > > I

[ceph-users] Re: MDS Behind on Trimming...

2024-04-11 Thread Nigel Williams
On Wed, 10 Apr 2024 at 14:01, Xiubo Li wrote: > > I assume if this fix is approved and backported it will then appear in > > like 18.2.3 or something? > > > Yeah, it will be backported after being well tested. > We believe we are being bitten by this bug too, looking forward to the fix. thanks.

[ceph-users] Re: ceph orch upgrade to 18.2.1 seems stuck on MDS?

2024-02-07 Thread Nigel Williams
On Wed, 7 Feb 2024 at 20:00, Nigel Williams wrote: > > and just MDS left to do but upgrade has been sitting for hours on this > > resolved by rebooting a single host...still not sure why this fixed it other than it had a standby MDS that would

[ceph-users] ceph orch upgrade to 18.2.1 seems stuck on MDS?

2024-02-07 Thread Nigel Williams
kicked off ceph orch upgrade start --image quay.io/ceph/ceph:v18.2.1 and just MDS left to do but upgrade has been sitting for hours on this root@rdx-00:~# ceph orch upgrade status { "target_image": " quay.io/ceph/ceph@sha256:a4e86c750cc11a8c93453ef5682acfa543e3ca08410efefa30f520b54f41831f ",

[ceph-users] Re: Awful new dashboard in Reef

2023-09-07 Thread Nigel Williams
On Thu, 7 Sept 2023 at 18:05, Nicola Mori wrote: > Is it just me or maybe my impressions are shared by someone else? Is > there anything that can be done to improve the situation? > I wonder about the implementation choice for this dashboard. I find with our Reef cluster it seems to get stuck du

[ceph-users] Re: Reef - what happened to OSD spec?

2023-08-29 Thread Nigel Williams
Thanks Eugen for following up. Sorry my second response was incomplete. I can confirm that it works as expected too. My confusion was that the section from the online documentation seemed to be missing/moved, and when it initially failed I wrongly thought that the OSD-add process had changed in the

[ceph-users] Re: Reef - what happened to OSD spec?

2023-08-28 Thread Nigel Williams
On Tue, 29 Aug 2023 at 10:09, Nigel Williams wrote: > and giving it a try it fails when it bumps into the root drive (which has > an active LVM). I expect I can add a filter to avoid it. > I found the cause of this initial failure when applying the spec from the web-gui. Even though I

[ceph-users] Reef - what happened to OSD spec?

2023-08-28 Thread Nigel Williams
We upgraded to Reef from Quincy, all went smoothly (thanks Ceph developers!) When adding OSDs, the process seems to have changed, the docs no longer mention OSD spec, and giving it a try it fails when it bumps into the root drive (which has an active LVM). I expect I can add a filter to avoid it.

[ceph-users] Re: Slow recovery on Quincy

2023-05-22 Thread Nigel Williams
We're on 17.2.5 and had the default value (5.2), but changing it didn't seem to impact recovery speed: root@rdx-00:/# ceph config get osd osd_mclock_cost_per_byte_usec_hdd 5.20 root@rdx-00:/# ceph config show osd.0 osd_op_queue mclock_scheduler root@rdx-00:/# ceph config set osd osd_mclock_cos

[ceph-users] Re: Quincy full osd(s)

2022-07-25 Thread Nigel Williams
Hi Wesley, thank you for the follow up. Anthony D'Atri kindly helped me out with some guidance and advice and we believe the problem is resolved now. This was a brand new install of a Quincy cluster and I made the mistake of presuming that autoscale would adjust the PGs as required, however it ne

[ceph-users] Quincy full osd(s)

2022-07-23 Thread Nigel Williams
With current 17.2.1 (cephadm) I am seeing an unusual HEALTH_ERR Adding files to a new empty cluster, replica 3 (crush is by host), OSDs became 95% full and reweighting them to any value does not cause backfill to start. If I reweight the three too full OSDs to 0.0 I get a large number of misplaced

[ceph-users] Re: v17.2.0 Quincy released

2022-04-20 Thread Nigel Williams
excellent work everyone! Regarding this: "Quincy does not support LevelDB. Please migrate your OSDs and monitors to RocksDB before upgrading to Quincy." Is there a convenient way to determine this for cephadm and non-cephadm setups? What happens if LevelDB is still active? does it cause an immedi

[ceph-users] Re: replace MON server keeping identity (Octopus)

2022-03-30 Thread Nigel Williams
Thank you York, that suggestion worked well. 'ceph-deploy mon destroy' on the old server followed by new server identity change, then 'ceph-deploy mon create' on this replacement worked. On Wed, 30 Mar 2022 at 19:06, York Huang wrote: > the shrink-mon.yml and add-mon.yml playbooks may give yo

[ceph-users] replace MON server keeping identity (Octopus)

2022-03-29 Thread Nigel Williams
This is a ceph-deploy setup. I would welcome suggestions as to how to replace a server hosting a MON; none of the following have worked for me: This fails: 1. Setup new server and copy MONMAP/keys to new server. 2. Shutdown old server (cluster out of quorum) 2. Change new server identity(IP, host

[ceph-users] Re: ceph fs Maximum number of files supported

2021-11-22 Thread Nigel Williams
On Sat, 20 Nov 2021 at 02:26, Yan, Zheng wrote: > we have FS contain more than 40 billions small files. > That is an impressive statistic! Are you able to share the output of ceph -s / ceph df /etc to get an idea of your cluster deployment? thanks. __

[ceph-users] Re: Corruption on cluster

2021-09-21 Thread Nigel Williams
Could we see the content of the bug report please, that RH bugzilla entry seems to have restricted access. "You are not authorized to access bug #1996680." On Wed, 22 Sept 2021 at 03:32, Patrick Donnelly wrote: > You're probably hitting this bug: > https://bugzilla.redhat.com/show_bug.cgi?id=199

[ceph-users] Re: podman daemons in error state - where to find logs?

2021-09-01 Thread Nigel Williams
thanks for the tip. All OSD logs on all hosts are zero length for me though, I suspect a permission problem but most hosts don't have a ceph user defined. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@

[ceph-users] cephadm 15.2.14 - mixed container registries?

2021-09-01 Thread Nigel Williams
I managed to upgrade to 15.2.14 by doing: ceph orch upgrade start --image quay.io/ceph/ceph:v15.2.14 (anything else I tried would fail) When I look in ceph orch ps output though I see quay.io for most image sources, but alertmanager, grafana, node-exporter are coming from docker.io Before doing

[ceph-users] Re: podman daemons in error state - where to find logs?

2021-08-31 Thread Nigel Williams
to answer my own question, the logs are meant to be in /var/log/ceph//... however on this host they were all zero length. On Tue, 31 Aug 2021 at 20:51, Nigel Williams wrote: > > Where to find more detailed logs? or do I need to adjust a log-level > firs

[ceph-users] podman daemons in error state - where to find logs?

2021-08-31 Thread Nigel Williams
Ubuntu 20.04.3, Octopus 152.13, cephadm + podman After a routine reboot, all OSDs on a host did not come up, after a few iterations of cephadm deploy, and fixing the missing config file, the daemons remain in the error state but neither journalctl / systemctl show any log errors other than exit s

[ceph-users] Re: v15-2-14-octopus no docker images on docker hub ceph/ceph ?

2021-08-19 Thread Nigel Williams
On Sun, 15 Aug 2021 at 00:10, Jadkins21 wrote: > Am I just being too impatient ? or did I miss something around docker > being discontinued for cephadmin ? (hope not, it's great) > not showing via podman either: root@rnk-00:~# podman pull docker.io/ceph/ceph:v15.2.14 Trying to pull docker.io/ce

[ceph-users] Octopus MDS hang under heavy setfattr load

2021-05-16 Thread Nigel Williams
One of my colleagues attempted to set quotas on a large number (some dozens) of users with the session below, but it caused the MDS to hang and reject client requests. Offending command was: cat recent-users | xargs -P16 -I% setfattr -n ceph.quota.max_bytes -v 8796093022208 /scratch/% Result was

[ceph-users] Re: Bluestore performance tuning for hdd with nvme db+wal

2020-06-30 Thread Nigel Williams
On Wed, 1 Jul 2020 at 01:47, Anthony D'Atri wrote: > > However when I've looked at the IO metrics for the nvme it seems to be only > > lightly loaded, so does not appear to be the issue (at 1st sight anyway). > > How are you determining “lightly loaded”. Not iostat %util I hope. For reference,

[ceph-users] Re: Nautilus OSD memory consumption?

2020-02-26 Thread Nigel Williams
On Thu, 27 Feb 2020 at 13:08, Nigel Williams wrote > On Thu, 27 Feb 2020 at 06:27, Anthony D'Atri wrote: > > If the heap stats reported by telling the OSD `heap stats` is large, > > telling each `heap release` may be useful. I suspect a TCMALLOC > > shortcoming.

[ceph-users] Re: Nautilus OSD memory consumption?

2020-02-26 Thread Nigel Williams
On Thu, 27 Feb 2020 at 06:27, Anthony D'Atri wrote: > If the heap stats reported by telling the OSD `heap stats` is large, telling > each `heap release` may be useful. I suspect a TCMALLOC shortcoming. osd.158 tcmalloc heap stats: MALLOC: 572

[ceph-users] Re: Nautilus OSD memory consumption?

2020-02-26 Thread Nigel Williams
On Wed, 26 Feb 2020 at 23:56, Mark Nelson wrote: > Have you tried dumping the mempools? ... > One reason this can happen for example is if you > have a huge number of PGs (like many thousands per OSD). We are relying on the pg autoscaler to set the PGs, and so far it seems to do the right thing.

[ceph-users] Re: Nautilus OSD memory consumption?

2020-02-25 Thread Nigel Williams
more examples of rampant OSD memory consumption: PID USER PR NIVIRTRESSHR S %CPU %MEM TIME+ COMMAND 1326773 ceph 20 0 11.585g 0.011t 34728 S 110.3 8.6 14:26.87 ceph-osd 204622 ceph 20 0 16.414g 0.015t 34808 S 100.3 12.5 17:53.36 ceph-osd 5706 ceph

[ceph-users] Nautilus OSD memory consumption?

2020-02-25 Thread Nigel Williams
The OOM-killer is on the rampage and striking down hapless OSDs when the cluster is under heavy client IO. The memory target does not seem to be much of a limit, is this intentional? root@cnx-11:~# ceph-conf --show-config|fgrep osd_memory_target osd_memory_target = 4294967296 osd_memory_target_cg

[ceph-users] Re: moving small production cluster to different datacenter

2020-01-30 Thread Nigel Williams
Did you end up having all new IPs for your MONs? I've wondered how should a large KVM deployment be handled when the instance-metadata has a hard-coded list of MON IPs for the cluster? how are they changed en-masse with running VMs? or do these moves always result in at least one MON with an origin

[ceph-users] Re: High swap usage on one replication node

2019-12-08 Thread Nigel Williams
On Sun, 8 Dec 2019 at 00:53, Martin Verges wrote: > Swap is nothing you want to have in a Server as it is very slow and can cause > long downtimes. Given the commentary on this page advocating at least some swap to enable Linux to manage memory when under pressure: https://utcc.utoronto.ca/~cks