[ceph-users] Re: has anyone enabled bdev_enable_discard?
I came across an enterprise NVMe used for BlueFS DB whose performance dropped sharply after a few months of delivery (I won't mention the brand here but it was not among these 3: Intel, Samsung, Micron). It is clear that enabling bdev_enable_discard impacted performance, but this option also saved the platform after a few days of discard. IMHO the most important thing is to validate the behavior when there has been a write to the entire flash media. But this option has the merit of existing. it seems to me that the ideal would be not to have several options on bdev_*discard, and that this task should be asynchronous and with the (D)iscard instructions during a calmer period of activity (I do not see any impact if the instructions are lost during an OSD reboot) Le ven. 1 mars 2024 à 19:17, Igor Fedotov a écrit : > I played with this feature a while ago and recall it had visible > negative impact on user operations due to the need to submit tons of > discard operations - effectively each data overwrite operation triggers > one or more discard operation submission to disk. > > And I doubt this has been widely used if any. > > Nevertheless recently we've got a PR to rework some aspects of thread > management for this stuff, see https://github.com/ceph/ceph/pull/55469 > > The author claimed they needed this feature for their cluster so you > might want to ask him about their user experience. > > > W.r.t documentation - actually there are just two options > > - bdev_enable_discard - enables issuing discard to disk > > - bdev_async_discard - instructs whether discard requests are issued > synchronously (along with disk extents release) or asynchronously (using > a background thread). > > Thanks, > > Igor > > On 01/03/2024 13:06, jst...@proxforge.de wrote: > > Is there any update on this? Did someone test the option and has > > performance values before and after? > > Is there any good documentation regarding this option? > > ___ > > ceph-users mailing list -- ceph-users@ceph.io > > To unsubscribe send an email to ceph-users-le...@ceph.io > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: has anyone enabled bdev_enable_discard?
We've had a specific set of drives that we've had to enable bdev_enable_discard and bdev_async_discard for in order to maintain acceptable performance on block clusters. I wrote the patch that Igor mentioned in order to try and send more parallel discards to the devices, but these ones in particular seem to process them in serial (based on observed discard counts and latency going to the device), which is unfortunate. We're also testing new firmware that suggests it should help alleviate some of the initial concerns we had about discards not keeping up which prompted the patch in the first place. Most of our drives do not need discards enabled (and definitely not without async) in order to maintain performance unless we're doing a full disk fio test or something like that where we're trying to find its cliff profile. We've used OSD classes to help target the options being applied to specific OSDs via centralized conf which helps when we would add new hosts that may have different drives so that the options weren't applied globally. Based on our experience, I wouldn't enable it unless you're seeing some sort of cliff-like behaviour as your OSDs run low on free space, or are heavily fragmented. I would also deem bdev_async_enabled = 1 to be a requirement so that it doesn't block user IO. Keep an eye on your discards being sent to devices and the discard latency, as well (via node_exporter, for example). Matt On 2024-03-02 06:18, David C. wrote: I came across an enterprise NVMe used for BlueFS DB whose performance dropped sharply after a few months of delivery (I won't mention the brand here but it was not among these 3: Intel, Samsung, Micron). It is clear that enabling bdev_enable_discard impacted performance, but this option also saved the platform after a few days of discard. IMHO the most important thing is to validate the behavior when there has been a write to the entire flash media. But this option has the merit of existing. it seems to me that the ideal would be not to have several options on bdev_*discard, and that this task should be asynchronous and with the (D)iscard instructions during a calmer period of activity (I do not see any impact if the instructions are lost during an OSD reboot) Le ven. 1 mars 2024 à 19:17, Igor Fedotov a écrit : I played with this feature a while ago and recall it had visible negative impact on user operations due to the need to submit tons of discard operations - effectively each data overwrite operation triggers one or more discard operation submission to disk. And I doubt this has been widely used if any. Nevertheless recently we've got a PR to rework some aspects of thread management for this stuff, see https://github.com/ceph/ceph/pull/55469 The author claimed they needed this feature for their cluster so you might want to ask him about their user experience. W.r.t documentation - actually there are just two options - bdev_enable_discard - enables issuing discard to disk - bdev_async_discard - instructs whether discard requests are issued synchronously (along with disk extents release) or asynchronously (using a background thread). Thanks, Igor On 01/03/2024 13:06, jst...@proxforge.de wrote: > Is there any update on this? Did someone test the option and has > performance values before and after? > Is there any good documentation regarding this option? > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: has anyone enabled bdev_enable_discard?
Could we not consider setting up a “bluefstrim” which could be orchestrated ? This would avoid having a continuous stream of (D)iscard instructions on the disks during activity. A weekly (probably monthly) bluefstrim could probably be enough for platforms that really need it. Le sam. 2 mars 2024 à 12:58, Matt Vandermeulen a écrit : > We've had a specific set of drives that we've had to enable > bdev_enable_discard and bdev_async_discard for in order to maintain > acceptable performance on block clusters. I wrote the patch that Igor > mentioned in order to try and send more parallel discards to the > devices, but these ones in particular seem to process them in serial > (based on observed discard counts and latency going to the device), > which is unfortunate. We're also testing new firmware that suggests it > should help alleviate some of the initial concerns we had about discards > not keeping up which prompted the patch in the first place. > > Most of our drives do not need discards enabled (and definitely not > without async) in order to maintain performance unless we're doing a > full disk fio test or something like that where we're trying to find its > cliff profile. We've used OSD classes to help target the options being > applied to specific OSDs via centralized conf which helps when we would > add new hosts that may have different drives so that the options weren't > applied globally. > > Based on our experience, I wouldn't enable it unless you're seeing some > sort of cliff-like behaviour as your OSDs run low on free space, or are > heavily fragmented. I would also deem bdev_async_enabled = 1 to be a > requirement so that it doesn't block user IO. Keep an eye on your > discards being sent to devices and the discard latency, as well (via > node_exporter, for example). > > Matt > > > On 2024-03-02 06:18, David C. wrote: > > I came across an enterprise NVMe used for BlueFS DB whose performance > > dropped sharply after a few months of delivery (I won't mention the > > brand > > here but it was not among these 3: Intel, Samsung, Micron). > > It is clear that enabling bdev_enable_discard impacted performance, but > > this option also saved the platform after a few days of discard. > > > > IMHO the most important thing is to validate the behavior when there > > has > > been a write to the entire flash media. > > But this option has the merit of existing. > > > > it seems to me that the ideal would be not to have several options on > > bdev_*discard, and that this task should be asynchronous and with the > > (D)iscard instructions during a calmer period of activity (I do not see > > any > > impact if the instructions are lost during an OSD reboot) > > > > > > Le ven. 1 mars 2024 à 19:17, Igor Fedotov a > > écrit : > > > >> I played with this feature a while ago and recall it had visible > >> negative impact on user operations due to the need to submit tons of > >> discard operations - effectively each data overwrite operation > >> triggers > >> one or more discard operation submission to disk. > >> > >> And I doubt this has been widely used if any. > >> > >> Nevertheless recently we've got a PR to rework some aspects of thread > >> management for this stuff, see https://github.com/ceph/ceph/pull/55469 > >> > >> The author claimed they needed this feature for their cluster so you > >> might want to ask him about their user experience. > >> > >> > >> W.r.t documentation - actually there are just two options > >> > >> - bdev_enable_discard - enables issuing discard to disk > >> > >> - bdev_async_discard - instructs whether discard requests are issued > >> synchronously (along with disk extents release) or asynchronously > >> (using > >> a background thread). > >> > >> Thanks, > >> > >> Igor > >> > >> On 01/03/2024 13:06, jst...@proxforge.de wrote: > >> > Is there any update on this? Did someone test the option and has > >> > performance values before and after? > >> > Is there any good documentation regarding this option? > >> > ___ > >> > ceph-users mailing list -- ceph-users@ceph.io > >> > To unsubscribe send an email to ceph-users-le...@ceph.io > >> ___ > >> ceph-users mailing list -- ceph-users@ceph.io > >> To unsubscribe send an email to ceph-users-le...@ceph.io > >> > > ___ > > ceph-users mailing list -- ceph-users@ceph.io > > To unsubscribe send an email to ceph-users-le...@ceph.io > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: has anyone enabled bdev_enable_discard?
Periodic discard was actually attempted in the past: https://github.com/ceph/ceph/pull/20723 A proper implementation would probably need appropriate scheduling/throttling that can be tuned so as to balance against client I/O impact. Josh On Sat, Mar 2, 2024 at 6:20 AM David C. wrote: > > Could we not consider setting up a “bluefstrim” which could be orchestrated > ? > > This would avoid having a continuous stream of (D)iscard instructions on > the disks during activity. > > A weekly (probably monthly) bluefstrim could probably be enough for > platforms that really need it. > > > Le sam. 2 mars 2024 à 12:58, Matt Vandermeulen a > écrit : > > > We've had a specific set of drives that we've had to enable > > bdev_enable_discard and bdev_async_discard for in order to maintain > > acceptable performance on block clusters. I wrote the patch that Igor > > mentioned in order to try and send more parallel discards to the > > devices, but these ones in particular seem to process them in serial > > (based on observed discard counts and latency going to the device), > > which is unfortunate. We're also testing new firmware that suggests it > > should help alleviate some of the initial concerns we had about discards > > not keeping up which prompted the patch in the first place. > > > > Most of our drives do not need discards enabled (and definitely not > > without async) in order to maintain performance unless we're doing a > > full disk fio test or something like that where we're trying to find its > > cliff profile. We've used OSD classes to help target the options being > > applied to specific OSDs via centralized conf which helps when we would > > add new hosts that may have different drives so that the options weren't > > applied globally. > > > > Based on our experience, I wouldn't enable it unless you're seeing some > > sort of cliff-like behaviour as your OSDs run low on free space, or are > > heavily fragmented. I would also deem bdev_async_enabled = 1 to be a > > requirement so that it doesn't block user IO. Keep an eye on your > > discards being sent to devices and the discard latency, as well (via > > node_exporter, for example). > > > > Matt > > > > > > On 2024-03-02 06:18, David C. wrote: > > > I came across an enterprise NVMe used for BlueFS DB whose performance > > > dropped sharply after a few months of delivery (I won't mention the > > > brand > > > here but it was not among these 3: Intel, Samsung, Micron). > > > It is clear that enabling bdev_enable_discard impacted performance, but > > > this option also saved the platform after a few days of discard. > > > > > > IMHO the most important thing is to validate the behavior when there > > > has > > > been a write to the entire flash media. > > > But this option has the merit of existing. > > > > > > it seems to me that the ideal would be not to have several options on > > > bdev_*discard, and that this task should be asynchronous and with the > > > (D)iscard instructions during a calmer period of activity (I do not see > > > any > > > impact if the instructions are lost during an OSD reboot) > > > > > > > > > Le ven. 1 mars 2024 à 19:17, Igor Fedotov a > > > écrit : > > > > > >> I played with this feature a while ago and recall it had visible > > >> negative impact on user operations due to the need to submit tons of > > >> discard operations - effectively each data overwrite operation > > >> triggers > > >> one or more discard operation submission to disk. > > >> > > >> And I doubt this has been widely used if any. > > >> > > >> Nevertheless recently we've got a PR to rework some aspects of thread > > >> management for this stuff, see https://github.com/ceph/ceph/pull/55469 > > >> > > >> The author claimed they needed this feature for their cluster so you > > >> might want to ask him about their user experience. > > >> > > >> > > >> W.r.t documentation - actually there are just two options > > >> > > >> - bdev_enable_discard - enables issuing discard to disk > > >> > > >> - bdev_async_discard - instructs whether discard requests are issued > > >> synchronously (along with disk extents release) or asynchronously > > >> (using > > >> a background thread). > > >> > > >> Thanks, > > >> > > >> Igor > > >> > > >> On 01/03/2024 13:06, jst...@proxforge.de wrote: > > >> > Is there any update on this? Did someone test the option and has > > >> > performance values before and after? > > >> > Is there any good documentation regarding this option? > > >> > ___ > > >> > ceph-users mailing list -- ceph-users@ceph.io > > >> > To unsubscribe send an email to ceph-users-le...@ceph.io > > >> ___ > > >> ceph-users mailing list -- ceph-users@ceph.io > > >> To unsubscribe send an email to ceph-users-le...@ceph.io > > >> > > > ___ > > > ceph-users mailing list -- ceph-users@ceph.io > > > To unsubscribe send an email to ceph-users-le...@c
[ceph-users] Question about erasure coding on cephfs
Hi Y'all, We have a new ceph cluster online that looks like this: md-01 : monitor, manager, mds md-02 : monitor, manager, mds md-03 : monitor, manager store-01 : twenty 30TB NVMe OSDs store-02 : twenty 30TB NVMe OSDs The cephfs storage is using erasure coding at 4:2. The crush domain is set to "osd". (I know that's not optimal but let me get to that in a minute) We have a current regular single NFS server (nfs-01) with the same storage as the OSD servers above (twenty 30TB NVME disks). We want to wipe the NFS server and integrate it into the above ceph cluster as "store-03". When we do that, we would then have three OSD servers. We would then switch the crush domain to "host". My question is this: Given that we have 4:2 erasure coding, would the data rebalance evenly across the three OSD servers after we add store-03 such that if a single OSD server went down, the other two would be enough to keep the system online? Like, with 4:2 erasure coding, would 2 shards go on store-01, then 2 shards on store-02, and then 2 shards on store-03? Is that how I understand it? Thanks for any insight! -erich ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Question about erasure coding on cephfs
> On Mar 2, 2024, at 10:37 AM, Erich Weiler wrote: > > Hi Y'all, > > We have a new ceph cluster online that looks like this: > > md-01 : monitor, manager, mds > md-02 : monitor, manager, mds > md-03 : monitor, manager > store-01 : twenty 30TB NVMe OSDs > store-02 : twenty 30TB NVMe OSDs > > The cephfs storage is using erasure coding at 4:2. The crush domain is set > to "osd". > > (I know that's not optimal but let me get to that in a minute) > > We have a current regular single NFS server (nfs-01) with the same storage as > the OSD servers above (twenty 30TB NVME disks). We want to wipe the NFS > server and integrate it into the above ceph cluster as "store-03". When we > do that, we would then have three OSD servers. We would then switch the > crush domain to "host". > > My question is this: Given that we have 4:2 erasure coding, would the data > rebalance evenly across the three OSD servers after we add store-03 such that > if a single OSD server went down, the other two would be enough to keep the > system online? Like, with 4:2 erasure coding, would 2 shards go on store-01, > then 2 shards on store-02, and then 2 shards on store-03? Is that how I > understand it? Nope. If the failure domain is *host*, without a carefully-crafted special CRUSH rule, CRUSH will want to spread the 6 shards over 6 failure domains, and you will only have 3. I don’t remember for sure if the PGs would be stuck remapped or stuck unable to activate, but either way you would have a very bad day. Say you craft a CRUSH rule that places two shards on each host. One host goes down, and you have at most K shards up. IIRC the PGs will be `inactive` but you won’t lose existing data. Here sind multiple reasons why for small deployments I favor 1U servers: * Having enough servers so that if one is down, service can proceed * Having enough failure domains to do EC — or at least replication — safely * Is your networking a bottleneck? Assuming these are QLC SSDs, do you have their min_alloc_size set to match the IU? Ideally you would mix in a couple of TLC OSDs on each server — in this case including the control plane — for the CephFS metadata pool. I’m curious which SSDs you’re using, please write me privately as I have history with QLC. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: ceph-crash NOT reporting crashes due to wrong permissions on /var/lib/ceph/crash/posted (Debian / Ubuntu packages)
> On 23.02.24 16:18, Christian Rohmann wrote: > > I just noticed issues with ceph-crash using the Debian /Ubuntu > > packages (package: ceph-base): > > > > While the /var/lib/ceph/crash/posted folder is created by the package > > install, > > it's not properly chowned to ceph:ceph by the postinst script. ... > > You might want to check if you might be affected as well. > > Failing to post crashes to the local cluster results in them not being > > reported back via telemetry. > > Sorry to bluntly bump this again, but did nobody else notice this on > your clusters? > Call me egoistic, but the more clusters return crash reports the more > stable my Ceph likely becomes ;-) I do observe the ownership does not match ceph:ceph on Debian with v17.2.7. $ sudo ls -l /var/lib/ceph/crash | grep posted drwxr-xr-x 2 root root 4096 Feb 10 19:23 posted The issue seems to be that the postinst script does not recursively chown and only chowns subdirectories directly under /var/lib/ceph: https://github.com/ceph/ceph/blob/91e8cea0d31775de0e59936b3608a9a453353a45/debian/ceph-base.postinst#L40 The rpm spec looks to do subdirectories under /var/lib/ceph as well, but explicitly lists everything out instead of globs, and also includes posted: https://github.com/ceph/ceph/blob/91e8cea0d31775de0e59936b3608a9a453353a45/ceph.spec.in#L1643 Tyler ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io