[ceph-users] Re: Modify pgp number after pg_num increased

2021-09-22 Thread Eugen Block
Hi, IIRC in a different thread you pasted your max-backfill config and it was the lowest possible value (1), right? That's why your backfill is slow. Zitat von "Szabo, Istvan (Agoda)" : Hi, By default in the newer versions of ceph when you increase the pg_num the cluster will start to

[ceph-users] Re: Successful Upgrade from 14.2.22 to 15.2.14

2021-09-22 Thread Eugen Block
Thanks for the summary, Dan! I'm still hesitating upgrading our production environment from N to O, your experience sounds reassuring though. I have one question, did you also switch to cephadm and containerize all daemons? We haven't made a decision yet, but I guess at some point we'll hav

[ceph-users] Balancer vs. Autoscaler

2021-09-22 Thread Jan-Philipp Litza
Hi everyone, I had the autoscale_mode set to "on" and the autoscaler went to work and started adjusting the number of PGs in that pool. Since this implies a huge shift in data, the reweights that the balancer had carefully adjusted (in crush-compat mode) are now rubbish, and more and more OSDs bec

[ceph-users] Re: Successful Upgrade from 14.2.22 to 15.2.14

2021-09-22 Thread Dan van der Ster
Hi Eugen, All of our prod clusters are still old school rpm packages managed by our private puppet manifests. Even our newest pacific pre-prod cluster is still managed like that. We have a side project to test and move to cephadm / containers but that is still a WIP. (Our situation is complicated

[ceph-users] Re: Successful Upgrade from 14.2.22 to 15.2.14

2021-09-22 Thread Dan van der Ster
Hi Andras, I'm not aware of any showstoppers to move directly to pacific. Indeed we already run pacific on a new cluster we built for our users to try cephfs snapshots at scale. That cluster was created with octopus a few months ago then upgraded to pacific at 16.2.4 to take advantage of the stray

[ceph-users] Why set osd flag to noout during upgrade ?

2021-09-22 Thread Francois Legrand
Hello everybody, I have a "stupid" question. Why is it recommended in the docs to set the osd flag to noout during an upgrade/maintainance (and especially during an osd upgrade/maintainance) ? In my understanding, if an osd goes down, after a while (600s by default) it's marked out and the c

[ceph-users] High overwrite latency

2021-09-22 Thread Erwin Ceph
Hi, We do run several Ceph clusters, but one has a strange problem. It is running Octopus 15.2.14 on 9 (HP 360 Gen 8, 64 GB, 10 Gbps) servers, 48 OSDs (all 2 TB Samsung SSDs with Bluestore). Monitoring in Grafana shows these three latency values over 7 days: ceph_osd_op_r_latency_sum: avg 1.1

[ceph-users] Re: Successful Upgrade from 14.2.22 to 15.2.14

2021-09-22 Thread Eugen Block
I understand, thanks for sharing! Zitat von Dan van der Ster : Hi Eugen, All of our prod clusters are still old school rpm packages managed by our private puppet manifests. Even our newest pacific pre-prod cluster is still managed like that. We have a side project to test and move to cephadm

[ceph-users] Re: Why set osd flag to noout during upgrade ?

2021-09-22 Thread Etienne Menguy
Hello, From my experience, I see three reasons : - You don’t want to recover data if you already have them on a down OSD, rebalancing can have a big impact on performance - If upgrade/maintenance goes wrong you will want to focus on this issue and not have to deal with things done by Ceph meanw

[ceph-users] Re: Why set osd flag to noout during upgrade ?

2021-09-22 Thread Dan van der Ster
Yeah you don't want to deal with backfilling while the cluster is upgrading. At best it can delay the upgrade, at worst mixed version backfilling has (rarely) caused issues in the past. We additionally `set noin` and disable the balancer: `ceph balancer off`. The former prevents broken osds from r

[ceph-users] Re: Is it normal Ceph reports "Degraded data redundancy" in normal use?

2021-09-22 Thread Kai Stian Olstad
On 21.09.2021 09:11, Kobi Ginon wrote: for sure the balancer affects the status Of course, but setting several PG to degraded is something else. i doubt that your customers will be writing so many objects in the same rate of the Test. I only need 2 host running rados bench to get several P

[ceph-users] Re: Balancer vs. Autoscaler

2021-09-22 Thread Dan van der Ster
To get an idea how much work is left, take a look at `ceph osd pool ls detail`. There should be pg_num_target... The osds will merge or split PGs until pg_num matches that value. .. Dan On Wed, 22 Sep 2021, 11:04 Jan-Philipp Litza, wrote: > Hi everyone, > > I had the autoscale_mode set to "on"

[ceph-users] Modify pgp number after pg_num increased

2021-09-22 Thread Szabo, Istvan (Agoda)
Hi, By default in the newer versions of ceph when you increase the pg_num the cluster will start to increase the pgp_num slowly up to the value of the pg_num. I've increased the ec-code data pool from 32 to 128 but 1 node has been added to the cluster and it's very slow. pool 28 'hkg.rgw.bucket

[ceph-users] Re: Successful Upgrade from 14.2.22 to 15.2.14

2021-09-22 Thread Andras Pataki
Hi Dan, This is excellent to hear - we've also been a bit hesitant to upgrade from Nautilus (which has been working so well for us).  One question: did you/would you consider upgrading straight to Pacific from Nautilus?  Can you share your thoughts that lead you to Octopus first? Thanks, An

[ceph-users] Change max backfills

2021-09-22 Thread Pascal Weißhaupt
Hi, I recently upgraded from Ceph 15 to Ceph 16 and when I want to change the max backfills via  ceph tell 'osd.*' injectargs '--osd-max-backfills 1' I get no output: root@pve01:~# ceph tell 'osd.*' injectargs '--osd-max-backfills 1' osd.0: {} osd.1: {} osd.2: {} osd.3: {} osd.4: {} os

[ceph-users] Re: Change max backfills

2021-09-22 Thread Etienne Menguy
Hi, In the past you had this output if value was not changing, try with another value. I don’t know if things changed with latest Ceph version. - Etienne Menguy etienne.men...@croit.io > On 22 Sep 2021, at 15:34, Pascal Weißhaupt > wrote: > > Hi, > > > > I recently upgraded from Ceph 1

[ceph-users] Re: Change max backfills

2021-09-22 Thread Pascal Weißhaupt
God damn...you are absolutely right - my bad. Sorry and thanks for that... -Ursprüngliche Nachricht- Von: Etienne Menguy  Gesendet: Mittwoch 22. September 2021 15:48 An: ceph-users@ceph.io Betreff: [ceph-users] Re: Change max backfills Hi, In the past you had this output if value

[ceph-users] Re: Modify pgp number after pg_num increased

2021-09-22 Thread Szabo, Istvan (Agoda)
That's been already increased to 4. Istvan Szabo Senior Infrastructure Engineer --- Agoda Services Co., Ltd. e: istvan.sz...@agoda.com --- -Original Message- From: Eugen Block Sent: Wednesday

[ceph-users] IO500 SC’21 Call for Submission

2021-09-22 Thread IO500 Committee
Stabilization period: Friday, 17th September - Friday, 1st October Submission deadline: Monday, 1st November 2021 AoE The IO500 [1] is now accepting and encouraging submissions for the upcoming 9th semi-annual IO500 list, in conjunction with SC'21. Once again, we are also accepting submissions

[ceph-users] "Remaining time" under-estimates by 100x....

2021-09-22 Thread Harry G. Coin
Is there a way to re-calibrate the various 'global recovery event' and related 'remaining time' estimators? For the last three days I've been assured that a 19h event will be over in under 3 hours... Previously I think Microsoft held the record for the most incorrect 'please wait' progress i

[ceph-users] Remoto 1.1.4 in Ceph 16.2.6 containers

2021-09-22 Thread David Orman
We'd worked on pushing a change to fix https://tracker.ceph.com/issues/50526 for a deadlock in remoto here: https://github.com/alfredodeza/remoto/pull/63 A new version, 1.2.1, was built to help with this. With the Ceph release 16.2.6 (at least), we see 1.1.4 is again part of the containers. Lookin

[ceph-users] Re: Remoto 1.1.4 in Ceph 16.2.6 containers

2021-09-22 Thread David Orman
I'm wondering if this was installed using pip/pypi before, and now switched to using EPEL? That would explain it - 1.2.1 may never have been pushed to EPEL. David On Wed, Sep 22, 2021 at 11:26 AM David Orman wrote: > > We'd worked on pushing a change to fix > https://tracker.ceph.com/issues/5052

[ceph-users] One PG keeps going inconsistent (stat mismatch)

2021-09-22 Thread Simon Ironside
Hi All, I have a recurring single PG that keeps going inconsistent. A scrub is enough to pick up the problem. The primary OSD log shows something like: 2021-09-22 18:08:18.502 7f5bdcb11700 0 log_channel(cluster) log [DBG] : 1.3ff scrub starts 2021-09-22 18:08:18.880 7f5bdcb11700 -1 log_chann

[ceph-users] Re: Why set osd flag to noout during upgrade ?

2021-09-22 Thread Anthony D'Atri
Indeed. In a large enough cluster, even a few minutes of extra backfill/recovery per OSD adds up. Say you have 100 OSD nodes, and just 3 minutes of unnecessary backfill per. That prolongs your upgrade by 5 hours. > Yeah you don't want to deal with backfilling while the cluster is > upgradi

[ceph-users] Re: Why set osd flag to noout during upgrade ?

2021-09-22 Thread Frank Schilder
In addition, from my experience: I often set noout, norebalance and nobackfill before doing maintenance. This greatly speeds up peering (when adding new OSDs) and reduces unnecessary load from all daemons. In particular, if there is heavy client IO going on at the same time, the ceph daemons ar

[ceph-users] Re: Remoto 1.1.4 in Ceph 16.2.6 containers

2021-09-22 Thread David Orman
https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2021-4b2736a28c ^^ if people want to test and provide feedback for a potential merge to EPEL8 stable. David On Wed, Sep 22, 2021 at 11:43 AM David Orman wrote: > > I'm wondering if this was installed using pip/pypi before, and now > switched t

[ceph-users] Re: Balancer vs. Autoscaler

2021-09-22 Thread Richard Bade
If you look at the current pg_num in that pool ls detail command that Dan mentioned you can set the pool pg_num to what that value currently is, which will effectively pause the pg changes. I did this recently when decreasing the number of pg's in a pool, which took several weeks to complete. This