[ceph-users] OSD bootstrap time
Hi everyone, recently I'm noticing that starting OSDs for the first time takes ages (like, more than an hour) before they are even picked up by the monitors as "up" and start backfilling. I'm not entirely sure if this is a new phenomenon or if it always was that way. Either way, I'd like to understand why. When I execute `ceph daemon osd.X status`, it says "state: preboot" and I can see the "newest_map" increase slowly. Apparently, a new OSD doesn't fetch the latest OSD map and gets to work, but instead fetches hundreds of thousands of OSD maps from the mon, burning CPU while parsing them. I wasn't able to find any good documentation on the OSDMap, in particular why its historical versions need to be kept and why the OSD seemingly needs so many of them. Can anybody point me in the right direction? Or is something wrong with my cluster? Best regards, Jan-Philipp Litza ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: OSD bootstrap time
Hi Rich, > I've noticed this a couple of times on Nautilus after doing some large > backfill operations. It seems the osd map doesn't clear properly after > the cluster returns to Health OK and builds up on the mons. I do a > "du" on the mon folder e.g. du -shx /var/lib/ceph/mon/ and this shows > several GB of data. It does, almost 8 GB for <300 OSDs, which increased several-fold over the last weeks (since we started upgrading Nautilus->Pacific). However, I didn't think much of it after reading in the docs about the hardware recommendations that require at least 60 GB per ceph-mon [1]. > I give all my mgrs and mons a restart and after a few minutes I can > see this osd map data getting purged from the mons. After a while it > should be back to a few hundred MB (depending on cluster size). > This may not be the problem in your case, but an easy thing to try. > Note, if your cluster is being held in Warning or Error by something > this can also explain the osd maps not clearing. Make sure you get the > cluster back to health OK first. Thanks for the suggestion, will try that once we reach HEALTH_OK. Best regards, Jan-Philipp [1]: https://docs.ceph.com/en/latest/start/hardware-recommendations/#minimum-hardware-recommendations ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: OSD bootstrap time
Hi Konstantin, I mean freshly deployed OSDs. Restarted OSDs don't exhibit that behavior. Best regards, Jan-Philipp ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: stretched cluster or not, with mon in 3 DC and osds on 2 DC
Hi, since I just read that documentation page [1] on Friday, I can't tell you anything that isn't on that page. But that particular problem of which monitor gets elected should be solvable simply by using connectivity election mode [2], shouldn't it? Apart from the latency to the mon, the stretch cluster is mainly about the failover characteristics of the OSDs: When DC1 or DC2 fails, without a stretch cluster, the other DC will try to replicate all the data again to reach size=4 again. With a stretch cluster, it will happily live with size=2 until the other DC comes back online. So when it's right to assume that if - god forbid - one of the DCs goes offline, it will come back online not too long after again, so that the cluster can live with size=2 during that phase, then a stretch cluster probably is the better choice. Also, as the documentation states, there are edge cases where even given an appropriate CRUSH rule, size=4 min_size=2 don't necessarily mean you have a live copy of every PG in each of the two DCs. Best regards, Jan-Philipp [1]: https://docs.ceph.com/en/latest/rados/operations/stretch-mode/ [2]: https://docs.ceph.com/en/latest/rados/operations/change-mon-elections/ ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: OSD bootstrap time
Hi again, turns out the long bootstrap time was my own fault. I had some down&out OSDs for quite a long time, which prohibited the monitor from pruning the OSD maps. Makes sense, when I think about it, but I didn't before. Rich's hint to get the cluster to health OK first pointed me in the right direction, as well as the docs on full OSDmap version pruning [1] that mention constraints in OSDMonitor::get_trim_to(). So I destroyed the OSDs (they don't hold any data anyway) and the mon's DBs shrank by almost 8 GB to only ~160 MB. Thanks for helping figuring this out! I promise to not have lingering down&out OSDs anymore. ;-) Best regards, Jan-Philipp [1]: https://docs.ceph.com/en/latest/dev/mon-osdmap-prune/ ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Spurious Read Errors: 0x6706be76
Hi Jay, I'm having the same problem, the setting doesn't affect the warning at all. I'm currently muting the warning every week or so (because it doesn't even seem to be present consistently, and every time it disappears for a moment, the mute is cancelled) with ceph health mute BLUESTORE_SPURIOUS_READ_ERRORS Best regards, Jan-Philipp ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: samba cephfs
That package probably contains the vfs_ceph module for Samba. However, further down, the same page says: > The above share configuration uses the Linux kernel CephFS client, which is > recommended for performance reasons. > As an alternative, the Samba vfs_ceph module can also be used to communicate > with the Ceph cluster. So when you use a kernel mount, you shouldn't need the package at all. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Balancer vs. Autoscaler
Hi everyone, I had the autoscale_mode set to "on" and the autoscaler went to work and started adjusting the number of PGs in that pool. Since this implies a huge shift in data, the reweights that the balancer had carefully adjusted (in crush-compat mode) are now rubbish, and more and more OSDs become nearful (we sadly have very different sized OSDs). Now apparently both manager modules, balancer and pg_autoscaler, have the same threshold for operation, namely target_max_misplaced_ratio. So the balancer won't become active as long as the pg_autoscaler is still adjusting the number of PGs. I already set the autoscale_mode to "warn" on all pools, but apparently the autoscaler is determined to finish what it started. Is there any way to pause the autoscaler so the balancer has a chance of fixing the reweights? Because even in manual mode (ceph balancer optimize), the balancer won't compute a plan when the misplaced ratio is higher than target_max_misplaced_ratio. I know about "ceph osd reweight-*", but they adjust the reweights (visible in "ceph osd tree"), whereas the balancer adjusts the "compat weight-set", which I don't know how to convert back to the old-style reweights. Best regards, Jan-Philipp ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Balancer vs. Autoscaler
I'll have to do some reading on what "pgp" means, but you are correct: The pg_num is already equal to pg_num_target, and only pgp_num is increasing (halfway there - at least that's something). Thanks for the suggestions, though not really applicable here! Richard Bade wrote: > If you look at the current pg_num in that pool ls detail command that > Dan mentioned you can set the pool pg_num to what that value currently > is, which will effectively pause the pg changes. I did this recently > when decreasing the number of pg's in a pool, which took several weeks > to complete. This let me get some other maintenance done before > setting the pg_num back to the target num again. > This works well for reduction, but I'm not sure if it works well for > increase as I think the pg_num may reach the target much faster and > then just the pgp_num changes till they match. > > Rich > > On Wed, 22 Sept 2021 at 23:06, Dan van der Ster wrote: >> >> To get an idea how much work is left, take a look at `ceph osd pool ls >> detail`. There should be pg_num_target... The osds will merge or split PGs >> until pg_num matches that value. >> >> .. Dan >> >> >> On Wed, 22 Sep 2021, 11:04 Jan-Philipp Litza, wrote: >> >>> Hi everyone, >>> >>> I had the autoscale_mode set to "on" and the autoscaler went to work and >>> started adjusting the number of PGs in that pool. Since this implies a >>> huge shift in data, the reweights that the balancer had carefully >>> adjusted (in crush-compat mode) are now rubbish, and more and more OSDs >>> become nearful (we sadly have very different sized OSDs). >>> >>> Now apparently both manager modules, balancer and pg_autoscaler, have >>> the same threshold for operation, namely target_max_misplaced_ratio. So >>> the balancer won't become active as long as the pg_autoscaler is still >>> adjusting the number of PGs. >>> >>> I already set the autoscale_mode to "warn" on all pools, but apparently >>> the autoscaler is determined to finish what it started. >>> >>> Is there any way to pause the autoscaler so the balancer has a chance of >>> fixing the reweights? Because even in manual mode (ceph balancer >>> optimize), the balancer won't compute a plan when the misplaced ratio is >>> higher than target_max_misplaced_ratio. >>> >>> I know about "ceph osd reweight-*", but they adjust the reweights >>> (visible in "ceph osd tree"), whereas the balancer adjusts the "compat >>> weight-set", which I don't know how to convert back to the old-style >>> reweights. >>> >>> Best regards, >>> Jan-Philipp >>> ___ >>> ceph-users mailing list -- ceph-users@ceph.io >>> To unsubscribe send an email to ceph-users-le...@ceph.io >>> >> ___ >> ceph-users mailing list -- ceph-users@ceph.io >> To unsubscribe send an email to ceph-users-le...@ceph.io > -- Jan-Philipp Litza PLUTEX GmbH Hermann-Ritter-Str. 108 28197 Bremen Hotline: 0800 100 400 800 Telefon: 0800 100 400 821 Telefax: 0800 100 400 888 E-Mail: supp...@plutex.de Internet: http://www.plutex.de USt-IdNr.: DE 815030856 Handelsregister: Amtsgericht Bremen, HRB 25144 Geschäftsführer: Torben Belz, Hendrik Lilienthal ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Questions about tweaking ceph rebalancing activities
You are basically listing all the reasons one shouldn't have too much misplacement at once. ;-) Your best bet probably is pgremapper [1] that I've recently learned about on this list. With `cancel-backfill`, you could stop any running backfill. With `undo-upmaps` you could then specifically start backfilling for those OSDs you want to destroy. The idea of pgremapper seems to be that the balancer will remove the upmaps over time, but since I'm still using the reweight-based balancer, I can't tell you if it really works that way. But since your misplacement is down to 0 as long as the upmaps are in place, the balancer definitely will do its work of mitigating nearfull OSDs. And AFAIK, setting the *weight* of a new OSD to 0 should prevent it from causing any rebalancing. However, this is different from reweighting it to 0 (third vs. sixth column in `ceph osd tree`)! Also, I don't see any advantage in setting the weight 0 over simply not yet creating the OSD. Best of luck! [1]: https://github.com/digitalocean/pgremapper ceph-us...@hovr.anonaddy.com wrote: > Hello all, > > I am in the progress of adding and removing a number of OSDs in my cluster > and I'm running in to some issues where it would be good to be able to > control the system a bit better. I've tried the documentation and google-fu > but have come up short. > > This is the background/scenario: I have a cluster that is/was working fine, > had HEALTH_OK. I've added a number of new OSDs to the cluster, starting a lot > of rebalancing. I also want to remove a number of OSDs from the cluster. Some > of these OSDs have been marked out. The cluster has been rebalancing for more > than two weeks and in state HEALTH_WARN. > > Inter-related issue 1 > While the cluster is rebalancing, I would like to prioritize migrating PGs > from the OSDs that have been marked out. Even though they are marked as out, > I can't stop them (down) and remove them (destroy/purge), since they still > have remaining PGs. For instance, I've had about eight OSDs with between 3 > and 7 PGs remaining (ceph osd safe-to-destroy ) for over a week. As > long as these handful of PGs are there, I can't remove those OSDs. I have set > osd_max_backfulls, osd_recovery_max_active, osd_recovery_single_start and > osd_recovery_sleep on the particular OSDs with no apparent affect, i.e. the > PGs are still remaining. > > Is there a way to prioritize particular OSDs/PGs for rebalancing? > > Inter-related issue 2 > An alternative would be to just destroy the almost empty OSDs anyway, > creating recovery activity instead of rebalancing. It doesn't seem like the > recovery activity is prioritized over the rebalancing activity. > > Is there a way to ensure recovery activities are prioritized over rebalancing > activities. > > Inter-related issue 3 > I spun up another OSD, marked it as up and out. This caused many additional > PGs to become misplaced. Stopping and destroying the new, empty OSD again > changed the number of misplaced PGs (returning to the previous > amount/percentage). > > Can I prevent this by reweighting the OSDs to 0 in addition to marking them > as out, or are there any other ways of preventing an OSD marked out to impact > the balancing? > > > Inter-related issue 4 > During rebalancing, several smaller OSDs have become near full. Then one > became full (>95%). This changed the cluster from HEALTH_WARN to HEALTH_ERR, > stopping client activities. Reweighting the OSD and the near full OSDs did > not change the cluster status. In essence, as far as I have understood it, > all the data is there and available, the cluster is in the process of a > massive rebalancing, PGs on the full OSD were misplaced and supposed to be > moved elsewhere (in any case after the manual reweighting), so there should > be no reason for the cluster to go to ERR. Also as a consequence of the > cluster rebalancing for a long time, the balancer module is prevented from > reweighting OSDs which could have prevented the ERR state (if the reweighting > had had an impact). My solution, which had to be performed by manual > intervention, was to mark the full OSD as out. The cluster changed back to > HEALTH_WARN, client operations resumed and the rebalancing could continue in > the background. > > Is there another way to handle a situation like this (an OSD becomes full, > while having misplaced PGs on it, blocking the cluster)? > > Apologies for so many questions in the same email! They are all part of the > same management activity for me. > > Many thanks! ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] "Pending Backport" without "Backports" field
Hi everyone, hope this is the right place to raise this issue. I stumbled upon a tracker issue [1] that has been stuck in state "Pending Backport" for 11 months, without even a single backport issue created - unusually long in my (limited) experience. Upon investigation, I found that according to the Tracker workflow [2], an issue that is pending backport should have its Backport field filled to be processed by the Backports team. This particular ticket doesn't have that field set. So presumably, that's why nobody created backport tickets. A quick search turns up 24 other tickets [3] that are pending backport but didn't specify, whereto the backports should happen. Is there someone who could "sweep up" such tickets regularly? Or am I misunderstanding the process? Thanks for the great work, Jan-Philipp [1]: https://tracker.ceph.com/issues/45457 [2]: https://github.com/ceph/ceph/blob/master/SubmittingPatches-backports.rst#tracker-workflow [3]: https://tracker.ceph.com/projects/ceph/issues?utf8=%E2%9C%93&set_filter=1&f[]=status_id&op[status_id]=%3D&v[status_id][]=14&f[]=cf_2&op[cf_2]=!* PS: Apparently all ceph mailing lists silently drop mails from non-subscribers? Was this always the case? ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Moving rbd-images across pools?
Hey Angelo, what you're asking for is "Live Migration". https://docs.ceph.com/en/latest/rbd/rbd-live-migration/ says: The live-migration copy process can safely run in the background while the new target image is in use. There is currently a requirement to temporarily stop using the source image before preparing a migration when not using the import-only mode of operation. This helps to ensure that the client using the image is updated to point to the new target image. Best regards, Jan-Philipp ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io