Hi,
IIRC in a different thread you pasted your max-backfill config and it
was the lowest possible value (1), right? That's why your backfill is
slow.
Zitat von "Szabo, Istvan (Agoda)" :
Hi,
By default in the newer versions of ceph when you increase the
pg_num the cluster will start to
Thanks for the summary, Dan!
I'm still hesitating upgrading our production environment from N to O,
your experience sounds reassuring though. I have one question, did you
also switch to cephadm and containerize all daemons? We haven't made a
decision yet, but I guess at some point we'll hav
Hi everyone,
I had the autoscale_mode set to "on" and the autoscaler went to work and
started adjusting the number of PGs in that pool. Since this implies a
huge shift in data, the reweights that the balancer had carefully
adjusted (in crush-compat mode) are now rubbish, and more and more OSDs
bec
Hi Eugen,
All of our prod clusters are still old school rpm packages managed by
our private puppet manifests. Even our newest pacific pre-prod cluster
is still managed like that.
We have a side project to test and move to cephadm / containers but
that is still a WIP. (Our situation is complicated
Hi Andras,
I'm not aware of any showstoppers to move directly to pacific. Indeed
we already run pacific on a new cluster we built for our users to try
cephfs snapshots at scale. That cluster was created with octopus a few
months ago then upgraded to pacific at 16.2.4 to take advantage of the
stray
Hello everybody,
I have a "stupid" question. Why is it recommended in the docs to set the
osd flag to noout during an upgrade/maintainance (and especially during
an osd upgrade/maintainance) ?
In my understanding, if an osd goes down, after a while (600s by
default) it's marked out and the c
Hi,
We do run several Ceph clusters, but one has a strange problem.
It is running Octopus 15.2.14 on 9 (HP 360 Gen 8, 64 GB, 10 Gbps) servers, 48
OSDs (all 2 TB Samsung SSDs with Bluestore). Monitoring in Grafana shows these
three latency values
over 7 days:
ceph_osd_op_r_latency_sum: avg 1.1
I understand, thanks for sharing!
Zitat von Dan van der Ster :
Hi Eugen,
All of our prod clusters are still old school rpm packages managed by
our private puppet manifests. Even our newest pacific pre-prod cluster
is still managed like that.
We have a side project to test and move to cephadm
Hello,
From my experience, I see three reasons :
- You don’t want to recover data if you already have them on a down OSD,
rebalancing can have a big impact on performance
- If upgrade/maintenance goes wrong you will want to focus on this issue and
not have to deal with things done by Ceph meanw
Yeah you don't want to deal with backfilling while the cluster is
upgrading. At best it can delay the upgrade, at worst mixed version
backfilling has (rarely) caused issues in the past.
We additionally `set noin` and disable the balancer: `ceph balancer off`.
The former prevents broken osds from r
On 21.09.2021 09:11, Kobi Ginon wrote:
for sure the balancer affects the status
Of course, but setting several PG to degraded is something else.
i doubt that your customers will be writing so many objects in the same
rate of the Test.
I only need 2 host running rados bench to get several P
To get an idea how much work is left, take a look at `ceph osd pool ls
detail`. There should be pg_num_target... The osds will merge or split PGs
until pg_num matches that value.
.. Dan
On Wed, 22 Sep 2021, 11:04 Jan-Philipp Litza, wrote:
> Hi everyone,
>
> I had the autoscale_mode set to "on"
Hi,
By default in the newer versions of ceph when you increase the pg_num the
cluster will start to increase the pgp_num slowly up to the value of the pg_num.
I've increased the ec-code data pool from 32 to 128 but 1 node has been added
to the cluster and it's very slow.
pool 28 'hkg.rgw.bucket
Hi Dan,
This is excellent to hear - we've also been a bit hesitant to upgrade
from Nautilus (which has been working so well for us). One question:
did you/would you consider upgrading straight to Pacific from Nautilus?
Can you share your thoughts that lead you to Octopus first?
Thanks,
An
Hi,
I recently upgraded from Ceph 15 to Ceph 16 and when I want to change the max
backfills via
ceph tell 'osd.*' injectargs '--osd-max-backfills 1'
I get no output:
root@pve01:~# ceph tell 'osd.*' injectargs '--osd-max-backfills 1'
osd.0: {}
osd.1: {}
osd.2: {}
osd.3: {}
osd.4: {}
os
Hi,
In the past you had this output if value was not changing, try with another
value.
I don’t know if things changed with latest Ceph version.
-
Etienne Menguy
etienne.men...@croit.io
> On 22 Sep 2021, at 15:34, Pascal Weißhaupt
> wrote:
>
> Hi,
>
>
>
> I recently upgraded from Ceph 1
God damn...you are absolutely right - my bad.
Sorry and thanks for that...
-Ursprüngliche Nachricht-
Von: Etienne Menguy
Gesendet: Mittwoch 22. September 2021 15:48
An: ceph-users@ceph.io
Betreff: [ceph-users] Re: Change max backfills
Hi,
In the past you had this output if value
That's been already increased to 4.
Istvan Szabo
Senior Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com
---
-Original Message-
From: Eugen Block
Sent: Wednesday
Stabilization period: Friday, 17th September - Friday, 1st October
Submission deadline: Monday, 1st November 2021 AoE
The IO500 [1] is now accepting and encouraging submissions for the
upcoming 9th semi-annual IO500 list, in conjunction with SC'21. Once
again, we are also accepting submissions
Is there a way to re-calibrate the various 'global recovery event' and
related 'remaining time' estimators?
For the last three days I've been assured that a 19h event will be over
in under 3 hours...
Previously I think Microsoft held the record for the most incorrect
'please wait' progress i
We'd worked on pushing a change to fix
https://tracker.ceph.com/issues/50526 for a deadlock in remoto here:
https://github.com/alfredodeza/remoto/pull/63
A new version, 1.2.1, was built to help with this. With the Ceph
release 16.2.6 (at least), we see 1.1.4 is again part of the
containers. Lookin
I'm wondering if this was installed using pip/pypi before, and now
switched to using EPEL? That would explain it - 1.2.1 may never have
been pushed to EPEL.
David
On Wed, Sep 22, 2021 at 11:26 AM David Orman wrote:
>
> We'd worked on pushing a change to fix
> https://tracker.ceph.com/issues/5052
Hi All,
I have a recurring single PG that keeps going inconsistent. A scrub is
enough to pick up the problem. The primary OSD log shows something like:
2021-09-22 18:08:18.502 7f5bdcb11700 0 log_channel(cluster) log [DBG] :
1.3ff scrub starts
2021-09-22 18:08:18.880 7f5bdcb11700 -1 log_chann
Indeed. In a large enough cluster, even a few minutes of extra
backfill/recovery per OSD adds up. Say you have 100 OSD nodes, and just 3
minutes of unnecessary backfill per. That prolongs your upgrade by 5 hours.
> Yeah you don't want to deal with backfilling while the cluster is
> upgradi
In addition, from my experience:
I often set noout, norebalance and nobackfill before doing maintenance. This
greatly speeds up peering (when adding new OSDs) and reduces unnecessary load
from all daemons. In particular, if there is heavy client IO going on at the
same time, the ceph daemons ar
https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2021-4b2736a28c
^^ if people want to test and provide feedback for a potential merge
to EPEL8 stable.
David
On Wed, Sep 22, 2021 at 11:43 AM David Orman wrote:
>
> I'm wondering if this was installed using pip/pypi before, and now
> switched t
If you look at the current pg_num in that pool ls detail command that
Dan mentioned you can set the pool pg_num to what that value currently
is, which will effectively pause the pg changes. I did this recently
when decreasing the number of pg's in a pool, which took several weeks
to complete. This
27 matches
Mail list logo