from:"Jan\-Philipp Litza"

[ceph-users] OSD bootstrap time

2021-06-08 Thread Jan-Philipp Litza

Hi everyone,

recently I'm noticing that starting OSDs for the first time takes ages
(like, more than an hour) before they are even picked up by the monitors
as "up" and start backfilling. I'm not entirely sure if this is a new
phenomenon or if it always was that way. Either way, I'd like to
understand why.

When I execute `ceph daemon osd.X status`, it says "state: preboot" and
I can see the "newest_map" increase slowly. Apparently, a new OSD
doesn't fetch the latest OSD map and gets to work, but instead fetches
hundreds of thousands of OSD maps from the mon, burning CPU while
parsing them.

I wasn't able to find any good documentation on the OSDMap, in
particular why its historical versions need to be kept and why the OSD
seemingly needs so many of them. Can anybody point me in the right
direction? Or is something wrong with my cluster?

Best regards,
Jan-Philipp Litza
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: OSD bootstrap time

2021-06-09 Thread Jan-Philipp Litza

Hi Rich,

> I've noticed this a couple of times on Nautilus after doing some large
> backfill operations. It seems the osd map doesn't clear properly after
> the cluster returns to Health OK and builds up on the mons. I do a
> "du" on the mon folder e.g. du -shx /var/lib/ceph/mon/ and this shows
> several GB of data.

It does, almost 8 GB for <300 OSDs, which increased several-fold over
the last weeks (since we started upgrading Nautilus->Pacific). However,
I didn't think much of it after reading in the docs about the hardware
recommendations that require at least 60 GB per ceph-mon [1].

> I give all my mgrs and mons a restart and after a few minutes I can
> see this osd map data getting purged from the mons. After a while it
> should be back to a few hundred MB (depending on cluster size).
> This may not be the problem in your case, but an easy thing to try.
> Note, if your cluster is being held in Warning or Error by something
> this can also explain the osd maps not clearing. Make sure you get the
> cluster back to health OK first.

Thanks for the suggestion, will try that once we reach HEALTH_OK.

Best regards,
Jan-Philipp

[1]:
https://docs.ceph.com/en/latest/start/hardware-recommendations/#minimum-hardware-recommendations
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: OSD bootstrap time

2021-06-09 Thread Jan-Philipp Litza

Hi Konstantin,

I mean freshly deployed OSDs. Restarted OSDs don't exhibit that behavior.

Best regards,
Jan-Philipp
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: stretched cluster or not, with mon in 3 DC and osds on 2 DC

2021-06-13 Thread Jan-Philipp Litza

Hi,

since I just read that documentation page [1] on Friday, I can't tell
you anything that isn't on that page. But that particular problem of
which monitor gets elected should be solvable simply by using
connectivity election mode [2], shouldn't it?

Apart from the latency to the mon, the stretch cluster is mainly about
the failover characteristics of the OSDs: When DC1 or DC2 fails, without
a stretch cluster, the other DC will try to replicate all the data again
to reach size=4 again. With a stretch cluster, it will happily live with
size=2 until the other DC comes back online.

So when it's right to assume that if - god forbid - one of the DCs goes
offline, it will come back online not too long after again, so that the
cluster can live with size=2 during that phase, then a stretch cluster
probably is the better choice.

Also, as the documentation states, there are edge cases where even given
an appropriate CRUSH rule, size=4 min_size=2 don't necessarily mean you
have a live copy of every PG in each of the two DCs.

Best regards,
Jan-Philipp


[1]: https://docs.ceph.com/en/latest/rados/operations/stretch-mode/
[2]: https://docs.ceph.com/en/latest/rados/operations/change-mon-elections/
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: OSD bootstrap time

2021-06-22 Thread Jan-Philipp Litza

Hi again,

turns out the long bootstrap time was my own fault. I had some down&out
OSDs for quite a long time, which prohibited the monitor from pruning
the OSD maps. Makes sense, when I think about it, but I didn't before.
Rich's hint to get the cluster to health OK first pointed me in the
right direction, as well as the docs on full OSDmap version pruning [1]
that mention constraints in OSDMonitor::get_trim_to().

So I destroyed the OSDs (they don't hold any data anyway) and the mon's
DBs shrank by almost 8 GB to only ~160 MB.

Thanks for helping figuring this out! I promise to not have lingering
down&out OSDs anymore. ;-)

Best regards,
Jan-Philipp

[1]: https://docs.ceph.com/en/latest/dev/mon-osdmap-prune/
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Spurious Read Errors: 0x6706be76

2021-07-06 Thread Jan-Philipp Litza

Hi Jay,

I'm having the same problem, the setting doesn't affect the warning at all.

I'm currently muting the warning every week or so (because it doesn't
even seem to be present consistently, and every time it disappears for a
moment, the mute is cancelled) with

ceph health mute BLUESTORE_SPURIOUS_READ_ERRORS

Best regards,
Jan-Philipp
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: samba cephfs

2021-07-12 Thread Jan-Philipp Litza

That package probably contains the vfs_ceph module for Samba. However,
further down, the same page says:

> The above share configuration uses the Linux kernel CephFS client, which is 
> recommended for performance reasons.
> As an alternative, the Samba vfs_ceph module can also be used to communicate 
> with the Ceph cluster.

So when you use a kernel mount, you shouldn't need the package at all.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Balancer vs. Autoscaler

2021-09-22 Thread Jan-Philipp Litza

Hi everyone,

I had the autoscale_mode set to "on" and the autoscaler went to work and
started adjusting the number of PGs in that pool. Since this implies a
huge shift in data, the reweights that the balancer had carefully
adjusted (in crush-compat mode) are now rubbish, and more and more OSDs
become nearful (we sadly have very different sized OSDs).

Now apparently both manager modules, balancer and pg_autoscaler, have
the same threshold for operation, namely target_max_misplaced_ratio. So
the balancer won't become active as long as the pg_autoscaler is still
adjusting the number of PGs.

I already set the autoscale_mode to "warn" on all pools, but apparently
the autoscaler is determined to finish what it started.

Is there any way to pause the autoscaler so the balancer has a chance of
fixing the reweights? Because even in manual mode (ceph balancer
optimize), the balancer won't compute a plan when the misplaced ratio is
higher than target_max_misplaced_ratio.

I know about "ceph osd reweight-*", but they adjust the reweights
(visible in "ceph osd tree"), whereas the balancer adjusts the "compat
weight-set", which I don't know how to convert back to the old-style
reweights.

Best regards,
Jan-Philipp
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Balancer vs. Autoscaler

2021-09-23 Thread Jan-Philipp Litza

I'll have to do some reading on what "pgp" means, but you are correct:
The pg_num is already equal to pg_num_target, and only pgp_num is
increasing (halfway there - at least that's something).

Thanks for the suggestions, though not really applicable here!

Richard Bade wrote:
> If you look at the current pg_num in that pool ls detail command that
> Dan mentioned you can set the pool pg_num to what that value currently
> is, which will effectively pause the pg changes. I did this recently
> when decreasing the number of pg's in a pool, which took several weeks
> to complete. This let me get some other maintenance done before
> setting the pg_num back to the target num again.
> This works well for reduction, but I'm not sure if it works well for
> increase as I think the pg_num may reach the target much faster and
> then just the pgp_num changes till they match.
> 
> Rich
> 
> On Wed, 22 Sept 2021 at 23:06, Dan van der Ster  wrote:
>>
>> To get an idea how much work is left, take a look at `ceph osd pool ls
>> detail`. There should be pg_num_target... The osds will merge or split PGs
>> until pg_num matches that value.
>>
>> .. Dan
>>
>>
>> On Wed, 22 Sep 2021, 11:04 Jan-Philipp Litza,  wrote:
>>
>>> Hi everyone,
>>>
>>> I had the autoscale_mode set to "on" and the autoscaler went to work and
>>> started adjusting the number of PGs in that pool. Since this implies a
>>> huge shift in data, the reweights that the balancer had carefully
>>> adjusted (in crush-compat mode) are now rubbish, and more and more OSDs
>>> become nearful (we sadly have very different sized OSDs).
>>>
>>> Now apparently both manager modules, balancer and pg_autoscaler, have
>>> the same threshold for operation, namely target_max_misplaced_ratio. So
>>> the balancer won't become active as long as the pg_autoscaler is still
>>> adjusting the number of PGs.
>>>
>>> I already set the autoscale_mode to "warn" on all pools, but apparently
>>> the autoscaler is determined to finish what it started.
>>>
>>> Is there any way to pause the autoscaler so the balancer has a chance of
>>> fixing the reweights? Because even in manual mode (ceph balancer
>>> optimize), the balancer won't compute a plan when the misplaced ratio is
>>> higher than target_max_misplaced_ratio.
>>>
>>> I know about "ceph osd reweight-*", but they adjust the reweights
>>> (visible in "ceph osd tree"), whereas the balancer adjusts the "compat
>>> weight-set", which I don't know how to convert back to the old-style
>>> reweights.
>>>
>>> Best regards,
>>> Jan-Philipp
>>> ___
>>> ceph-users mailing list -- ceph-users@ceph.io
>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>>
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
> 

-- 
Jan-Philipp Litza
PLUTEX GmbH
Hermann-Ritter-Str. 108
28197 Bremen

Hotline: 0800 100 400 800
Telefon: 0800 100 400 821
Telefax: 0800 100 400 888
E-Mail: supp...@plutex.de
Internet: http://www.plutex.de

USt-IdNr.: DE 815030856
Handelsregister: Amtsgericht Bremen, HRB 25144
Geschäftsführer: Torben Belz, Hendrik Lilienthal
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Questions about tweaking ceph rebalancing activities

2021-10-19 Thread Jan-Philipp Litza

You are basically listing all the reasons one shouldn't have too much
misplacement at once. ;-)

Your best bet probably is pgremapper [1] that I've recently learned
about on this list. With `cancel-backfill`, you could stop any running
backfill. With `undo-upmaps` you could then specifically start
backfilling for those OSDs you want to destroy.

The idea of pgremapper seems to be that the balancer will remove the
upmaps over time, but since I'm still using the reweight-based balancer,
I can't tell you if it really works that way. But since your
misplacement is down to 0 as long as the upmaps are in place, the
balancer definitely will do its work of mitigating nearfull OSDs.

And AFAIK, setting the *weight* of a new OSD to 0 should prevent it from
causing any rebalancing. However, this is different from reweighting it
to 0 (third vs. sixth column in `ceph osd tree`)! Also, I don't see any
advantage in setting the weight 0 over simply not yet creating the OSD.

Best of luck!

[1]: https://github.com/digitalocean/pgremapper

ceph-us...@hovr.anonaddy.com wrote:
> Hello all,
> 
> I am in the progress of adding and removing a number of OSDs in my cluster 
> and I'm running in to some issues where it would be good to be able to 
> control the system a bit better. I've tried the documentation and google-fu 
> but have come up short.
> 
> This is the background/scenario: I have a cluster that is/was working fine, 
> had HEALTH_OK. I've added a number of new OSDs to the cluster, starting a lot 
> of rebalancing. I also want to remove a number of OSDs from the cluster. Some 
> of these OSDs have been marked out. The cluster has been rebalancing for more 
> than two weeks and in state HEALTH_WARN.
> 
> Inter-related issue 1
> While the cluster is rebalancing, I would like to prioritize migrating PGs 
> from the OSDs that have been marked out. Even though they are marked as out, 
> I can't stop them (down) and remove them (destroy/purge), since they still 
> have remaining PGs. For instance, I've had about eight OSDs with between 3 
> and 7 PGs remaining (ceph osd safe-to-destroy ) for over a week. As 
> long as these handful of PGs are there, I can't remove those OSDs. I have set 
> osd_max_backfulls, osd_recovery_max_active, osd_recovery_single_start and 
> osd_recovery_sleep on the particular OSDs with no apparent affect, i.e. the 
> PGs are still remaining.
> 
> Is there a way to prioritize particular OSDs/PGs for rebalancing?
> 
> Inter-related issue 2
> An alternative would be to just destroy the almost empty OSDs anyway, 
> creating recovery activity instead of rebalancing. It doesn't seem like the 
> recovery activity is prioritized over the rebalancing activity.
> 
> Is there a way to ensure recovery activities are prioritized over rebalancing 
> activities.
> 
> Inter-related issue 3
> I spun up another OSD, marked it as up and out. This caused many additional 
> PGs to become misplaced. Stopping and destroying the new, empty OSD again 
> changed the number of misplaced PGs (returning to the previous 
> amount/percentage).
> 
> Can I prevent this by reweighting the OSDs to 0 in addition to marking them 
> as out, or are there any other ways of preventing an OSD marked out to impact 
> the balancing?
> 
> 
> Inter-related issue 4
> During rebalancing, several smaller OSDs have become near full. Then one 
> became full (>95%). This changed the cluster from HEALTH_WARN to HEALTH_ERR, 
> stopping client activities. Reweighting the OSD and the near full OSDs did 
> not change the cluster status. In essence, as far as I have understood it, 
> all the data is there and available, the cluster is in the process of a 
> massive rebalancing, PGs on the full OSD were misplaced and supposed to be 
> moved elsewhere (in any case after the manual reweighting), so there should 
> be no reason for the cluster to go to ERR. Also as a consequence of the 
> cluster rebalancing for a long time, the balancer module is prevented from 
> reweighting OSDs which could have prevented the ERR state (if the reweighting 
> had had an impact). My solution, which had to be performed by manual 
> intervention, was to mark the full OSD as out. The cluster changed back to 
> HEALTH_WARN, client operations resumed and the rebalancing could continue in 
> the background.
> 
> Is there another way to handle a situation like this (an OSD becomes full, 
> while having misplaced PGs on it, blocking the cluster)?
> 
> Apologies for so many questions in the same email! They are all part of the 
> same management activity for me.
> 
> Many thanks!
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] "Pending Backport" without "Backports" field

2022-05-30 Thread Jan-Philipp Litza

Hi everyone,

hope this is the right place to raise this issue.

I stumbled upon a tracker issue [1] that has been stuck in state
"Pending Backport" for 11 months, without even a single backport issue
created - unusually long in my (limited) experience.

Upon investigation, I found that according to the Tracker workflow [2],
an issue that is pending backport should have its Backport field filled
to be processed by the Backports team. This particular ticket doesn't
have that field set. So presumably, that's why nobody created backport
tickets.

A quick search turns up 24 other tickets [3] that are pending backport
but didn't specify, whereto the backports should happen. Is there
someone who could "sweep up" such tickets regularly? Or am I
misunderstanding the process?

Thanks for the great work,
Jan-Philipp

[1]: https://tracker.ceph.com/issues/45457
[2]:
https://github.com/ceph/ceph/blob/master/SubmittingPatches-backports.rst#tracker-workflow
[3]:
https://tracker.ceph.com/projects/ceph/issues?utf8=%E2%9C%93&set_filter=1&f[]=status_id&op[status_id]=%3D&v[status_id][]=14&f[]=cf_2&op[cf_2]=!*


PS: Apparently all ceph mailing lists silently drop mails from
non-subscribers? Was this always the case?
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Moving rbd-images across pools?

2022-06-02 Thread Jan-Philipp Litza


Hey Angelo,

what you're asking for is "Live Migration".
https://docs.ceph.com/en/latest/rbd/rbd-live-migration/ says:

The live-migration copy process can safely run in the background while the new 
target image is in use. There is currently a requirement to temporarily stop 
using the source image before preparing a migration when not using the 
import-only mode of operation. This helps to ensure that the client using the 
image is updated to point to the new target image.

Best regards,
Jan-Philipp
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] OSD bootstrap time

[ceph-users] Re: OSD bootstrap time

[ceph-users] Re: OSD bootstrap time

[ceph-users] Re: stretched cluster or not, with mon in 3 DC and osds on 2 DC

[ceph-users] Re: OSD bootstrap time

[ceph-users] Re: Spurious Read Errors: 0x6706be76

[ceph-users] Re: samba cephfs

[ceph-users] Balancer vs. Autoscaler

[ceph-users] Re: Balancer vs. Autoscaler

[ceph-users] Re: Questions about tweaking ceph rebalancing activities

[ceph-users] "Pending Backport" without "Backports" field

[ceph-users] Re: Moving rbd-images across pools?

12 matches

Site Navigation

Mail list logo

Footer information