[ceph-users] Re: cephadm host maintenance
Hello Steven, Arguably, it should, but right now nothing is implemented to do so and you'd have to manually run the "ceph mgr fail node2-cobj2-atdev1-nvan.ghxlvw" before it would allow you to put the host in maintenance. It's non-trivial from a technical point of view to have it automatically do the switch as the cephadm instance is running on that active mgr, so it will have to store somewhere that we wanted this host in maintenance, fail over the mgr itself, then have the new cephadm instance pick up that we wanted the host in maintenance and do so. Possible, but not something anyone has had a chance to implement. FWIW, I do believe there are also plans to eventually have a playbook for a rolling reboot or something of the sort added to https://github.com/ceph/cephadm-ansible. But for now, I think some sort of intervention to cause the fail over to happen before running the maintenance enter command is necessary. Regards, - Adam King On Wed, Jul 13, 2022 at 11:02 AM Steven Goodliff < steven.goodl...@globalrelay.net> wrote: > > Hi, > > > I'm trying to reboot a ceph cluster one instance at a time by running in a > Ansible playbook which basically runs > > > cephadm shell ceph orch host maintenance enter and then > reboots the instance and exits the maintenance > > > but i get > > > ALERT: Cannot stop active Mgr daemon, Please switch active Mgrs with 'ceph > mgr fail node2-cobj2-atdev1-nvan.ghxlvw' > > > on one instance. should cephadm handle the switch ? > > > thanks > > Steven Goodliff > Global Relay > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: cephadm host maintenance
This brings up a good follow on…. Rebooting in general for OS patching. I have not been leveraging the maintenance mode function, as I found it was really no different than just setting noout and doing the reboot. I find if the box is the active manager the failover happens quick, painless and automatically. All the OSD’s just show as missing and come back once the box is back from reboot… Am I causing issues I may not be aware of? How is everyone handling patching reboots? The only place I’m careful is the active MDS nodes, since that failover does cause a period of no i/o for the mounted clients, I generally fail that manually so I can ensure I don’t have to wait for the MDS to figure out an instance is gone and spin up a standby…. Any tips or techniques until there is a more holistic approach? Thanks! On Wed, Jul 13, 2022 at 9:49 AM Adam King wrote: > Hello Steven, > > Arguably, it should, but right now nothing is implemented to do so and > you'd have to manually run the "ceph mgr fail > node2-cobj2-atdev1-nvan.ghxlvw" before it would allow you to put the host > in maintenance. It's non-trivial from a technical point of view to have it > automatically do the switch as the cephadm instance is running on that > active mgr, so it will have to store somewhere that we wanted this host in > maintenance, fail over the mgr itself, then have the new cephadm instance > pick up that we wanted the host in maintenance and do so. Possible, but not > something anyone has had a chance to implement. FWIW, I do believe there > are also plans to eventually have a playbook for a rolling reboot or > something of the sort added to https://github.com/ceph/cephadm-ansible. > But > for now, I think some sort of intervention to cause the fail over to happen > before running the maintenance enter command is necessary. > > Regards, > - Adam King > > On Wed, Jul 13, 2022 at 11:02 AM Steven Goodliff < > steven.goodl...@globalrelay.net> wrote: > > > > > Hi, > > > > > > I'm trying to reboot a ceph cluster one instance at a time by running in > a > > Ansible playbook which basically runs > > > > > > cephadm shell ceph orch host maintenance enter and then > > reboots the instance and exits the maintenance > > > > > > but i get > > > > > > ALERT: Cannot stop active Mgr daemon, Please switch active Mgrs with > 'ceph > > mgr fail node2-cobj2-atdev1-nvan.ghxlvw' > > > > > > on one instance. should cephadm handle the switch ? > > > > > > thanks > > > > Steven Goodliff > > Global Relay > > ___ > > ceph-users mailing list -- ceph-users@ceph.io > > To unsubscribe send an email to ceph-users-le...@ceph.io > > > > > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Radosgw issues after upgrade to 14.2.21
Hello, We recently upgraded a 3 node cluster running luminous 12.2.13(ceph repos) on Debian 9 to Nautilus v14.2.21(Debian stable repo) on Debian 11. For the most part everything seems to be fine with the exception of access to the bucket defined inside of RadosGW. Since the upgrade users are now getting 403 Access Denied when trying to list their objects and/or put new objects in the bucket. We've attempted to re-apply the IAM policies defined on the bucket for the users, however that fails. Even after taking ownership of the bucket with a newly created account's credentials via radosgw-admin. We've also added caps(buckets *, users *, policy *) to the newly created bucket "admin" but that didn't help either in re-applying the IAM policies. What are we missing? Richard Andrews ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] rbd iostat requires pool specified
Hoping this may be trivial to point me towards, but I typically keep a background screen running `rbd perf image iostat` that shows all of the rbd devices with io, and how busy that disk may be at any given moment. Recently after upgrading everything to latest octopus release (15.2.16), it no longer allows for not specifying the pool, which then means I can’t blend all rbd pools together into a single view. How it used to appear: > NAMEWRRDWR_BYTESRD_BYTES > WR_LATRD_LAT > rbd-ssd/app1 322/s 0/s 5.6 MiB/s 0 B/s 2.28 > ms 0.00 ns > rbd-ssd/app2 223/s 5/s 2.1 MiB/s 147 KiB/s 3.56 > ms 1.12 ms > rbd-hybrid/app3 76/s 0/s11 MiB/s 0 B/s16.61 > ms 0.00 ns > rbd-hybrid/app4 11/s 0/s 395 KiB/s 0 B/s51.29 > ms 0.00 ns > rbd-hybrid/app53/s 0/s74 KiB/s 0 B/s 151.54 > ms 0.00 ns > rbd-hybrid/app60/s 0/s42 KiB/s 0 B/s13.90 > ms 0.00 ns > rbd-hybrid/app70/s 0/s 2.4 KiB/s 0 B/s 1.70 > ms 0.00 ns > > NAMEWRRDWR_BYTES RD_BYTES > WR_LAT RD_LAT > rbd-ssd/app1 483/s 0/s 7.3 MiB/s 0 B/s2.17 > ms 0.00 ns > rbd-ssd/app2 279/s 5/s 2.5 MiB/s 69 KiB/s3.82 > ms 516.30 us > rbd-hybrid/app3 147/s 0/s10 MiB/s 0 B/s8.59 > ms 0.00 ns > rbd-hybrid/app6 10/s 0/s 425 KiB/s 0 B/s 75.79 > ms 0.00 ns > rbd-hybrid/app80/s 0/s 2.4 KiB/s 0 B/s1.85 > ms 0.00 ns > $ uname -r && rbd --version && rbd perf image iostat > 5.4.0-107-generic > ceph version 15.2.16 (d46a73d6d0a67a79558054a3a5a72cb561724974) octopus > (stable) > rbd: mgr command failed: (2) No such file or directory: [errno 2] RADOS > object not found (Pool 'rbd' not found) This is ubuntu 20.04, using packages rather than cephadm. I do not have a pool named `rbd` so that is correct, but I have a handful of pools with the rbd application set. > $ for pool in rbd-{ssd,hybrid,ec82} ; do ceph osd pool application get $pool > ; done > { > "rbd": {} > } > { > "rbd": {} > } > { > "rbd": {} > } Looking at the help output, it doesn’t seem to imply that the `pool-spec` is optional, and it won’t take wildcard globs like `rbd*` for the pool name. > $ rbd help perf image iostat > usage: rbd perf image iostat [--pool ] [--namespace ] > [--iterations ] [--sort-by ] > [--format ] [--pretty-format] > > > Display image IO statistics. > > Positional arguments > pool specification > (example: [/] > > Optional arguments > -p [ --pool ] arg pool name > --namespace argnamespace name > --iterations arg iterations of metric collection [> 0] > --sort-by arg (=write_ops) sort-by IO metric (write-ops, read-ops, > write-bytes, read-bytes, write-latency, > read-latency) [default: write-ops] > --format arg output format (plain, json, or xml) [default: > plain] > --pretty-formatpretty formatting (json and xml) Setting a pool name to one of my rbd pools either as pool-spec or -p/—pool works, but obviously only for that pool, and not for *all* rbd pools, as it functioned previously, in what appears to have been 15.2.13 previously. I didn’t see a PR mentioned in the 5.2.14-16 release notes that seemed to mention changes to rbd that would affect this, but I could have glossed over something. Appreciate any pointers. Thanks, Reed ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: pacific doesn't defer small writes for pre-pacific hdd osds
Hi! My apologies for butting in. Please confirm that bluestore_prefer_deferred_size_hdd is a runtime option, which doesn't require OSDs to be stopped or rebuilt? Best regards, Zakhar On Tue, 12 Jul 2022 at 14:46, Dan van der Ster wrote: > Hi Igor, > > Thank you for the reply and information. > I confirm that `ceph config set osd bluestore_prefer_deferred_size_hdd > 65537` correctly defers writes in my clusters. > > Best regards, > > Dan > > > > On Tue, Jul 12, 2022 at 1:16 PM Igor Fedotov > wrote: > > > > Hi Dan, > > > > I can confirm this is a regression introduced by > https://github.com/ceph/ceph/pull/42725. > > > > Indeed strict comparison is a key point in your specific case but > generally it looks like this piece of code needs more redesign to better > handle fragmented allocations (and issue deferred write for every short > enough fragment independently). > > > > So I'm looking for a way to improve that at the moment. Will fallback to > trivial comparison fix if I fail to do find better solution. > > > > Meanwhile you can adjust bluestore_min_alloc_size_hdd indeed but I'd > prefer not to raise it that high as 128K to avoid too many writes being > deferred (and hence DB overburden). > > > > IMO setting the parameter to 64K+1 should be fine. > > > > > > Thanks, > > > > Igor > > > > On 7/7/2022 12:43 AM, Dan van der Ster wrote: > > > > Hi Igor and others, > > > > (apologies for html, but i want to share a plot ;) ) > > > > We're upgrading clusters to v16.2.9 from v15.2.16, and our simple "rados > bench -p test 10 write -b 4096 -t 1" latency probe showed something is very > wrong with deferred writes in pacific. > > Here is an example cluster, upgraded today: > > > > > > > > The OSDs are 12TB HDDs, formatted in nautilus with the default > bluestore_min_alloc_size_hdd = 64kB, and each have a large flash block.db. > > > > I found that the performance issue is because 4kB writes are no longer > deferred from those pre-pacific hdds to flash in pacific with the default > config !!! > > Here are example bench writes from both releases: > https://pastebin.com/raw/m0yL1H9Z > > > > I worked out that the issue is fixed if I set > bluestore_prefer_deferred_size_hdd = 128k (up from the 64k pacific default. > Note the default was 32k in octopus). > > > > I think this is related to the fixes in > https://tracker.ceph.com/issues/52089 which landed in 16.2.6 -- > _do_alloc_write is comparing the prealloc size 0x1 with > bluestore_prefer_deferred_size_hdd (0x1) and the "strictly less than" > condition prevents deferred writes from ever happening. > > > > So I think this would impact anyone upgrading clusters with hdd/ssd > mixed osds ... surely we must not be the only clusters impacted by this?! > > > > Should we increase the default bluestore_prefer_deferred_size_hdd up to > 128kB or is there in fact a bug here? > > > > Best Regards, > > > > Dan > > > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: pacific doesn't defer small writes for pre-pacific hdd osds
Yes, that is correct. No need to restart the osds. .. Dan On Thu., Jul. 14, 2022, 07:04 Zakhar Kirpichenko, wrote: > Hi! > > My apologies for butting in. Please confirm > that bluestore_prefer_deferred_size_hdd is a runtime option, which doesn't > require OSDs to be stopped or rebuilt? > > Best regards, > Zakhar > > On Tue, 12 Jul 2022 at 14:46, Dan van der Ster wrote: > >> Hi Igor, >> >> Thank you for the reply and information. >> I confirm that `ceph config set osd bluestore_prefer_deferred_size_hdd >> 65537` correctly defers writes in my clusters. >> >> Best regards, >> >> Dan >> >> >> >> On Tue, Jul 12, 2022 at 1:16 PM Igor Fedotov >> wrote: >> > >> > Hi Dan, >> > >> > I can confirm this is a regression introduced by >> https://github.com/ceph/ceph/pull/42725. >> > >> > Indeed strict comparison is a key point in your specific case but >> generally it looks like this piece of code needs more redesign to better >> handle fragmented allocations (and issue deferred write for every short >> enough fragment independently). >> > >> > So I'm looking for a way to improve that at the moment. Will fallback >> to trivial comparison fix if I fail to do find better solution. >> > >> > Meanwhile you can adjust bluestore_min_alloc_size_hdd indeed but I'd >> prefer not to raise it that high as 128K to avoid too many writes being >> deferred (and hence DB overburden). >> > >> > IMO setting the parameter to 64K+1 should be fine. >> > >> > >> > Thanks, >> > >> > Igor >> > >> > On 7/7/2022 12:43 AM, Dan van der Ster wrote: >> > >> > Hi Igor and others, >> > >> > (apologies for html, but i want to share a plot ;) ) >> > >> > We're upgrading clusters to v16.2.9 from v15.2.16, and our simple >> "rados bench -p test 10 write -b 4096 -t 1" latency probe showed something >> is very wrong with deferred writes in pacific. >> > Here is an example cluster, upgraded today: >> > >> > >> > >> > The OSDs are 12TB HDDs, formatted in nautilus with the default >> bluestore_min_alloc_size_hdd = 64kB, and each have a large flash block.db. >> > >> > I found that the performance issue is because 4kB writes are no longer >> deferred from those pre-pacific hdds to flash in pacific with the default >> config !!! >> > Here are example bench writes from both releases: >> https://pastebin.com/raw/m0yL1H9Z >> > >> > I worked out that the issue is fixed if I set >> bluestore_prefer_deferred_size_hdd = 128k (up from the 64k pacific default. >> Note the default was 32k in octopus). >> > >> > I think this is related to the fixes in >> https://tracker.ceph.com/issues/52089 which landed in 16.2.6 -- >> _do_alloc_write is comparing the prealloc size 0x1 with >> bluestore_prefer_deferred_size_hdd (0x1) and the "strictly less than" >> condition prevents deferred writes from ever happening. >> > >> > So I think this would impact anyone upgrading clusters with hdd/ssd >> mixed osds ... surely we must not be the only clusters impacted by this?! >> > >> > Should we increase the default bluestore_prefer_deferred_size_hdd up to >> 128kB or is there in fact a bug here? >> > >> > Best Regards, >> > >> > Dan >> > >> ___ >> ceph-users mailing list -- ceph-users@ceph.io >> To unsubscribe send an email to ceph-users-le...@ceph.io >> > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: pacific doesn't defer small writes for pre-pacific hdd osds
Many thanks, Dan. Much appreciated! /Z On Thu, 14 Jul 2022 at 08:43, Dan van der Ster wrote: > Yes, that is correct. No need to restart the osds. > > .. Dan > > > On Thu., Jul. 14, 2022, 07:04 Zakhar Kirpichenko, > wrote: > >> Hi! >> >> My apologies for butting in. Please confirm >> that bluestore_prefer_deferred_size_hdd is a runtime option, which doesn't >> require OSDs to be stopped or rebuilt? >> >> Best regards, >> Zakhar >> >> On Tue, 12 Jul 2022 at 14:46, Dan van der Ster >> wrote: >> >>> Hi Igor, >>> >>> Thank you for the reply and information. >>> I confirm that `ceph config set osd bluestore_prefer_deferred_size_hdd >>> 65537` correctly defers writes in my clusters. >>> >>> Best regards, >>> >>> Dan >>> >>> >>> >>> On Tue, Jul 12, 2022 at 1:16 PM Igor Fedotov >>> wrote: >>> > >>> > Hi Dan, >>> > >>> > I can confirm this is a regression introduced by >>> https://github.com/ceph/ceph/pull/42725. >>> > >>> > Indeed strict comparison is a key point in your specific case but >>> generally it looks like this piece of code needs more redesign to better >>> handle fragmented allocations (and issue deferred write for every short >>> enough fragment independently). >>> > >>> > So I'm looking for a way to improve that at the moment. Will fallback >>> to trivial comparison fix if I fail to do find better solution. >>> > >>> > Meanwhile you can adjust bluestore_min_alloc_size_hdd indeed but I'd >>> prefer not to raise it that high as 128K to avoid too many writes being >>> deferred (and hence DB overburden). >>> > >>> > IMO setting the parameter to 64K+1 should be fine. >>> > >>> > >>> > Thanks, >>> > >>> > Igor >>> > >>> > On 7/7/2022 12:43 AM, Dan van der Ster wrote: >>> > >>> > Hi Igor and others, >>> > >>> > (apologies for html, but i want to share a plot ;) ) >>> > >>> > We're upgrading clusters to v16.2.9 from v15.2.16, and our simple >>> "rados bench -p test 10 write -b 4096 -t 1" latency probe showed something >>> is very wrong with deferred writes in pacific. >>> > Here is an example cluster, upgraded today: >>> > >>> > >>> > >>> > The OSDs are 12TB HDDs, formatted in nautilus with the default >>> bluestore_min_alloc_size_hdd = 64kB, and each have a large flash block.db. >>> > >>> > I found that the performance issue is because 4kB writes are no longer >>> deferred from those pre-pacific hdds to flash in pacific with the default >>> config !!! >>> > Here are example bench writes from both releases: >>> https://pastebin.com/raw/m0yL1H9Z >>> > >>> > I worked out that the issue is fixed if I set >>> bluestore_prefer_deferred_size_hdd = 128k (up from the 64k pacific default. >>> Note the default was 32k in octopus). >>> > >>> > I think this is related to the fixes in >>> https://tracker.ceph.com/issues/52089 which landed in 16.2.6 -- >>> _do_alloc_write is comparing the prealloc size 0x1 with >>> bluestore_prefer_deferred_size_hdd (0x1) and the "strictly less than" >>> condition prevents deferred writes from ever happening. >>> > >>> > So I think this would impact anyone upgrading clusters with hdd/ssd >>> mixed osds ... surely we must not be the only clusters impacted by this?! >>> > >>> > Should we increase the default bluestore_prefer_deferred_size_hdd up >>> to 128kB or is there in fact a bug here? >>> > >>> > Best Regards, >>> > >>> > Dan >>> > >>> ___ >>> ceph-users mailing list -- ceph-users@ceph.io >>> To unsubscribe send an email to ceph-users-le...@ceph.io >>> >> ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: CephFS snapshots with samba shadowcopy
Hi, I am providing CephFS snapshots via Samba with the shadow_copy2 VFS object. I am running CentOS 7 with smbd 4.10.16 for which ceph_snapshots is not available AFAIK. Snapshots are created by a cronjob above the root of my shares with export TZ=GMT mkdir /cephfs/path/.snap/`date +@GMT-%Y.%m.%d-%H.%M.%S` i.e. the exported shares are subfolders of the folder in which I create snapshots. Samba configuration is: [global] ... shadow:snapdir = .snap shadow:snapdirseverywhere = yes shadow:format = _@GMT-%Y.%m.%d-%H.%M.%S_some-inode-number ... [sharename] ... path = /cephfs/path_to_main_root/share vfs object = shadow_copy2 ... [other_share_with_different_root] ... path = /cephfs/path_to_different_root/other_share vfs object = shadow_copy2 shadow:format = _@GMT-%Y.%m.%d-%H.%M.%S_other-inode-number The inode numbers in the configuration are of course the inode numbers of the directory containing the snapshots. Cheers Sebastian On 13.07.22 02:08, Bailey Allison wrote: Hi All, Curious if anyone is making use of samba shadowcopy with CephFS snapshots using the vfs object ceph_snapshots? I've had wildly different results on an Ubuntu 20.04 LTS samba server where the snaps just do not appear at all within shadowcopy, and a Rocky Linux samba server where the snaps do appear within shadowcopy but when opening them they contain absolutely no files at all. Both the Ubuntu and Rocky samba server are sharing out kernel cephfs mount via samba, ceph version is 17.2.1 and samba version is 4.13.7 for Ubuntu 20.04 and 4.15.5 for Rocky Linux. I have also tried using a samba fuse mount with vfs_ceph with the same results. More so just curious to see if anyone on the list has had success with making use of the ceph_snapshots vfs object and if they can share how it has worked for them. Included below is the share config for both Ubuntu and Rocky if anyone is curious: Ubuntu 20.04 LTS [public] force group = nogroup force user = nobody guest ok = Yes path = /mnt/cephfs/public read only = No vfs objects = ceph_snapshots Rocky Linux [public] force group = nogroup force user = nobody guest ok = Yes path = /mnt/cephfs/public read only = No vfs objects = ceph_snapshots Regards, Bailey ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io -- Dr. Sebastian Knust | Bielefeld University IT Administrator | Faculty of Physics Office: D2-110 | Universitätsstr. 25 Phone: +49 521 106 5234 | 33615 Bielefeld ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] MGR permissions question
Hi, we have discovered this solution for CSI plugin permissions: https://github.com/ceph/ceph-csi/issues/2687#issuecomment-1014360244 We are not sure of the implications of adding the mgr permissions to the (non admin) user. The documentation seems to be sparse on this topic. Is it ok to give a limited user just mgr permissions or can we restrict it. Thanks Best Robert ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: size=1 min_size=0 any way to set?
Just go straightforward to set size=1 and min_size=1. Setting min_size to 0 does not make any sense. huxia...@horebdata.cn From: Szabo, Istvan (Agoda) Date: 2022-07-13 11:38 To: ceph-users@ceph.io Subject: [ceph-users] size=1 min_size=0 any way to set? Hi, Is there a way to set this? Yes I know it will have immediately data loss but the data is not important and it can be reproduced easily so temporarily would like to set, however ceph doesn't allow: Error EINVAL: pool min_size must be between 1 and size, which is set to 1 Anybody knows any way? Thank you This message is confidential and is for the sole use of the intended recipient(s). It may also be privileged or otherwise protected by copyright or other legal rules. If you have received it by mistake please let us know by reply email and delete it from your system. It is prohibited to copy this message or disclose its content to anyone. Any confidentiality or privilege is not waived or lost by any mistaken delivery or unauthorized disclosure of the message. All messages sent to and from Agoda may be monitored to ensure compliance with company policies, to protect the company's interests and to remove potential malware. Electronic messages may be intercepted, amended, lost or deleted, or contain viruses. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: size=1 min_size=0 any way to set?
As far as I know, one can read and write with it. huxia...@horebdata.cn From: Szabo, Istvan (Agoda) Date: 2022-07-13 11:49 To: huxia...@horebdata.cn CC: ceph-users Subject: RE: [ceph-users] size=1 min_size=0 any way to set? But that one makes the pool read only I guess right? Istvan Szabo Senior Infrastructure Engineer --- Agoda Services Co., Ltd. e: istvan.sz...@agoda.com --- From: huxia...@horebdata.cn Sent: Wednesday, July 13, 2022 4:48 PM To: Szabo, Istvan (Agoda) Cc: ceph-users Subject: Re: [ceph-users] size=1 min_size=0 any way to set? Email received from the internet. If in doubt, don't click any link nor open any attachment ! Just go straightforward to set size=1 and min_size=1. Setting min_size to 0 does not make any sense. huxia...@horebdata.cn From: Szabo, Istvan (Agoda) Date: 2022-07-13 11:38 To: ceph-users@ceph.io Subject: [ceph-users] size=1 min_size=0 any way to set? Hi, Is there a way to set this? Yes I know it will have immediately data loss but the data is not important and it can be reproduced easily so temporarily would like to set, however ceph doesn't allow: Error EINVAL: pool min_size must be between 1 and size, which is set to 1 Anybody knows any way? Thank you This message is confidential and is for the sole use of the intended recipient(s). It may also be privileged or otherwise protected by copyright or other legal rules. If you have received it by mistake please let us know by reply email and delete it from your system. It is prohibited to copy this message or disclose its content to anyone. Any confidentiality or privilege is not waived or lost by any mistaken delivery or unauthorized disclosure of the message. All messages sent to and from Agoda may be monitored to ensure compliance with company policies, to protect the company's interests and to remove potential malware. Electronic messages may be intercepted, amended, lost or deleted, or contain viruses. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: pacific doesn't defer small writes for pre-pacific hdd osds
Is this something that makes sense to do the 'quick' fix on for the next pacific release to minimize impact to users until the improved iteration can be implemented? On Tue, Jul 12, 2022 at 6:16 AM Igor Fedotov wrote: > Hi Dan, > > I can confirm this is a regression introduced by > https://github.com/ceph/ceph/pull/42725. > > Indeed strict comparison is a key point in your specific case but > generally it looks like this piece of code needs more redesign to > better handle fragmented allocations (and issue deferred write for every > short enough fragment independently). > > So I'm looking for a way to improve that at the moment. Will fallback to > trivial comparison fix if I fail to do find better solution. > > Meanwhile you can adjust bluestore_min_alloc_size_hdd indeed but I'd > prefer not to raise it that high as 128K to avoid too many writes being > deferred (and hence DB overburden). > > IMO setting the parameter to 64K+1 should be fine. > > > Thanks, > > Igor > > On 7/7/2022 12:43 AM, Dan van der Ster wrote: > > Hi Igor and others, > > > > (apologies for html, but i want to share a plot ;) ) > > > > We're upgrading clusters to v16.2.9 from v15.2.16, and our simple > > "rados bench -p test 10 write -b 4096 -t 1" latency probe showed > > something is very wrong with deferred writes in pacific. > > Here is an example cluster, upgraded today: > > > > image.png > > > > The OSDs are 12TB HDDs, formatted in nautilus with the default > > bluestore_min_alloc_size_hdd = 64kB, and each have a large flash > block.db. > > > > I found that the performance issue is because 4kB writes are no longer > > deferred from those pre-pacific hdds to flash in pacific with the > > default config !!! > > Here are example bench writes from both releases: > > https://pastebin.com/raw/m0yL1H9Z > > > > I worked out that the issue is fixed if I set > > bluestore_prefer_deferred_size_hdd = 128k (up from the 64k pacific > > default. Note the default was 32k in octopus). > > > > I think this is related to the fixes in > > https://tracker.ceph.com/issues/52089 which landed in 16.2.6 -- > > _do_alloc_write is comparing the prealloc size 0x1 with > > bluestore_prefer_deferred_size_hdd (0x1) and the "strictly less > > than" condition prevents deferred writes from ever happening. > > > > So I think this would impact anyone upgrading clusters with hdd/ssd > > mixed osds ... surely we must not be the only clusters impacted by this?! > > > > Should we increase the default bluestore_prefer_deferred_size_hdd up > > to 128kB or is there in fact a bug here? > > > > Best Regards, > > > > Dan > > > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] cephadm host maintenance
Hi, I'm trying to reboot a ceph cluster one instance at a time by running in a Ansible playbook which basically runs cephadm shell ceph orch host maintenance enter and then reboots the instance and exits the maintenance but i get ALERT: Cannot stop active Mgr daemon, Please switch active Mgrs with 'ceph mgr fail node2-cobj2-atdev1-nvan.ghxlvw' on one instance. should cephadm handle the switch ? thanks Steven Goodliff Global Relay ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: pacific doesn't defer small writes for pre-pacific hdd osds
May be. My plan is to attempt to make general fix and if this wouldn't work within a short time frame - publish a 'quick' one. On 7/13/2022 4:58 PM, David Orman wrote: Is this something that makes sense to do the 'quick' fix on for the next pacific release to minimize impact to users until the improved iteration can be implemented? On Tue, Jul 12, 2022 at 6:16 AM Igor Fedotov wrote: Hi Dan, I can confirm this is a regression introduced by https://github.com/ceph/ceph/pull/42725. Indeed strict comparison is a key point in your specific case but generally it looks like this piece of code needs more redesign to better handle fragmented allocations (and issue deferred write for every short enough fragment independently). So I'm looking for a way to improve that at the moment. Will fallback to trivial comparison fix if I fail to do find better solution. Meanwhile you can adjust bluestore_min_alloc_size_hdd indeed but I'd prefer not to raise it that high as 128K to avoid too many writes being deferred (and hence DB overburden). IMO setting the parameter to 64K+1 should be fine. Thanks, Igor On 7/7/2022 12:43 AM, Dan van der Ster wrote: > Hi Igor and others, > > (apologies for html, but i want to share a plot ;) ) > > We're upgrading clusters to v16.2.9 from v15.2.16, and our simple > "rados bench -p test 10 write -b 4096 -t 1" latency probe showed > something is very wrong with deferred writes in pacific. > Here is an example cluster, upgraded today: > > image.png > > The OSDs are 12TB HDDs, formatted in nautilus with the default > bluestore_min_alloc_size_hdd = 64kB, and each have a large flash block.db. > > I found that the performance issue is because 4kB writes are no longer > deferred from those pre-pacific hdds to flash in pacific with the > default config !!! > Here are example bench writes from both releases: > https://pastebin.com/raw/m0yL1H9Z > > I worked out that the issue is fixed if I set > bluestore_prefer_deferred_size_hdd = 128k (up from the 64k pacific > default. Note the default was 32k in octopus). > > I think this is related to the fixes in > https://tracker.ceph.com/issues/52089 which landed in 16.2.6 -- > _do_alloc_write is comparing the prealloc size 0x1 with > bluestore_prefer_deferred_size_hdd (0x1) and the "strictly less > than" condition prevents deferred writes from ever happening. > > So I think this would impact anyone upgrading clusters with hdd/ssd > mixed osds ... surely we must not be the only clusters impacted by this?! > > Should we increase the default bluestore_prefer_deferred_size_hdd up > to 128kB or is there in fact a bug here? > > Best Regards, > > Dan > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io