[ceph-users] Re: has anyone enabled bdev_enable_discard?

2024-03-02 Thread David C.
I came across an enterprise NVMe used for BlueFS DB whose performance
dropped sharply after a few months of delivery (I won't mention the brand
here but it was not among these 3: Intel, Samsung, Micron).
It is clear that enabling bdev_enable_discard impacted performance, but
this option also saved the platform after a few days of discard.

IMHO the most important thing is to validate the behavior when there has
been a write to the entire flash media.
But this option has the merit of existing.

it seems to me that the ideal would be not to have several options on
bdev_*discard, and that this task should be asynchronous and with the
(D)iscard instructions during a calmer period of activity (I do not see any
impact if the instructions are lost during an OSD reboot)


Le ven. 1 mars 2024 à 19:17, Igor Fedotov  a écrit :

> I played with this feature a while ago and recall it had visible
> negative impact on user operations due to the need to submit tons of
> discard operations - effectively each data overwrite operation triggers
> one or more discard operation submission to disk.
>
> And I doubt this has been widely used if any.
>
> Nevertheless recently we've got a PR to rework some aspects of thread
> management for this stuff, see https://github.com/ceph/ceph/pull/55469
>
> The author claimed they needed this feature for their cluster so you
> might want to ask him about their user experience.
>
>
> W.r.t documentation - actually there are just two options
>
> - bdev_enable_discard - enables issuing discard to disk
>
> - bdev_async_discard - instructs whether discard requests are issued
> synchronously (along with disk extents release) or asynchronously (using
> a background thread).
>
> Thanks,
>
> Igor
>
> On 01/03/2024 13:06, jst...@proxforge.de wrote:
> > Is there any update on this? Did someone test the option and has
> > performance values before and after?
> > Is there any good documentation regarding this option?
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: has anyone enabled bdev_enable_discard?

2024-03-02 Thread Matt Vandermeulen
We've had a specific set of drives that we've had to enable 
bdev_enable_discard and bdev_async_discard for in order to maintain 
acceptable performance on block clusters. I wrote the patch that Igor 
mentioned in order to try and send more parallel discards to the 
devices, but these ones in particular seem to process them in serial 
(based on observed discard counts and latency going to the device), 
which is unfortunate. We're also testing new firmware that suggests it 
should help alleviate some of the initial concerns we had about discards 
not keeping up which prompted the patch in the first place.


Most of our drives do not need discards enabled (and definitely not 
without async) in order to maintain performance unless we're doing a 
full disk fio test or something like that where we're trying to find its 
cliff profile. We've used OSD classes to help target the options being 
applied to specific OSDs via centralized conf which helps when we would 
add new hosts that may have different drives so that the options weren't 
applied globally.


Based on our experience, I wouldn't enable it unless you're seeing some 
sort of cliff-like behaviour as your OSDs run low on free space, or are 
heavily fragmented. I would also deem bdev_async_enabled = 1 to be a 
requirement so that it doesn't block user IO. Keep an eye on your 
discards being sent to devices and the discard latency, as well (via 
node_exporter, for example).


Matt


On 2024-03-02 06:18, David C. wrote:

I came across an enterprise NVMe used for BlueFS DB whose performance
dropped sharply after a few months of delivery (I won't mention the 
brand

here but it was not among these 3: Intel, Samsung, Micron).
It is clear that enabling bdev_enable_discard impacted performance, but
this option also saved the platform after a few days of discard.

IMHO the most important thing is to validate the behavior when there 
has

been a write to the entire flash media.
But this option has the merit of existing.

it seems to me that the ideal would be not to have several options on
bdev_*discard, and that this task should be asynchronous and with the
(D)iscard instructions during a calmer period of activity (I do not see 
any

impact if the instructions are lost during an OSD reboot)


Le ven. 1 mars 2024 à 19:17, Igor Fedotov  a 
écrit :



I played with this feature a while ago and recall it had visible
negative impact on user operations due to the need to submit tons of
discard operations - effectively each data overwrite operation 
triggers

one or more discard operation submission to disk.

And I doubt this has been widely used if any.

Nevertheless recently we've got a PR to rework some aspects of thread
management for this stuff, see https://github.com/ceph/ceph/pull/55469

The author claimed they needed this feature for their cluster so you
might want to ask him about their user experience.


W.r.t documentation - actually there are just two options

- bdev_enable_discard - enables issuing discard to disk

- bdev_async_discard - instructs whether discard requests are issued
synchronously (along with disk extents release) or asynchronously 
(using

a background thread).

Thanks,

Igor

On 01/03/2024 13:06, jst...@proxforge.de wrote:
> Is there any update on this? Did someone test the option and has
> performance values before and after?
> Is there any good documentation regarding this option?
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: has anyone enabled bdev_enable_discard?

2024-03-02 Thread David C.
Could we not consider setting up a “bluefstrim” which could be orchestrated
?

This would avoid having a continuous stream of (D)iscard instructions on
the disks during activity.

A weekly (probably monthly) bluefstrim could probably be enough for
platforms that really need it.


Le sam. 2 mars 2024 à 12:58, Matt Vandermeulen  a
écrit :

> We've had a specific set of drives that we've had to enable
> bdev_enable_discard and bdev_async_discard for in order to maintain
> acceptable performance on block clusters. I wrote the patch that Igor
> mentioned in order to try and send more parallel discards to the
> devices, but these ones in particular seem to process them in serial
> (based on observed discard counts and latency going to the device),
> which is unfortunate. We're also testing new firmware that suggests it
> should help alleviate some of the initial concerns we had about discards
> not keeping up which prompted the patch in the first place.
>
> Most of our drives do not need discards enabled (and definitely not
> without async) in order to maintain performance unless we're doing a
> full disk fio test or something like that where we're trying to find its
> cliff profile. We've used OSD classes to help target the options being
> applied to specific OSDs via centralized conf which helps when we would
> add new hosts that may have different drives so that the options weren't
> applied globally.
>
> Based on our experience, I wouldn't enable it unless you're seeing some
> sort of cliff-like behaviour as your OSDs run low on free space, or are
> heavily fragmented. I would also deem bdev_async_enabled = 1 to be a
> requirement so that it doesn't block user IO. Keep an eye on your
> discards being sent to devices and the discard latency, as well (via
> node_exporter, for example).
>
> Matt
>
>
> On 2024-03-02 06:18, David C. wrote:
> > I came across an enterprise NVMe used for BlueFS DB whose performance
> > dropped sharply after a few months of delivery (I won't mention the
> > brand
> > here but it was not among these 3: Intel, Samsung, Micron).
> > It is clear that enabling bdev_enable_discard impacted performance, but
> > this option also saved the platform after a few days of discard.
> >
> > IMHO the most important thing is to validate the behavior when there
> > has
> > been a write to the entire flash media.
> > But this option has the merit of existing.
> >
> > it seems to me that the ideal would be not to have several options on
> > bdev_*discard, and that this task should be asynchronous and with the
> > (D)iscard instructions during a calmer period of activity (I do not see
> > any
> > impact if the instructions are lost during an OSD reboot)
> >
> >
> > Le ven. 1 mars 2024 à 19:17, Igor Fedotov  a
> > écrit :
> >
> >> I played with this feature a while ago and recall it had visible
> >> negative impact on user operations due to the need to submit tons of
> >> discard operations - effectively each data overwrite operation
> >> triggers
> >> one or more discard operation submission to disk.
> >>
> >> And I doubt this has been widely used if any.
> >>
> >> Nevertheless recently we've got a PR to rework some aspects of thread
> >> management for this stuff, see https://github.com/ceph/ceph/pull/55469
> >>
> >> The author claimed they needed this feature for their cluster so you
> >> might want to ask him about their user experience.
> >>
> >>
> >> W.r.t documentation - actually there are just two options
> >>
> >> - bdev_enable_discard - enables issuing discard to disk
> >>
> >> - bdev_async_discard - instructs whether discard requests are issued
> >> synchronously (along with disk extents release) or asynchronously
> >> (using
> >> a background thread).
> >>
> >> Thanks,
> >>
> >> Igor
> >>
> >> On 01/03/2024 13:06, jst...@proxforge.de wrote:
> >> > Is there any update on this? Did someone test the option and has
> >> > performance values before and after?
> >> > Is there any good documentation regarding this option?
> >> > ___
> >> > ceph-users mailing list -- ceph-users@ceph.io
> >> > To unsubscribe send an email to ceph-users-le...@ceph.io
> >> ___
> >> ceph-users mailing list -- ceph-users@ceph.io
> >> To unsubscribe send an email to ceph-users-le...@ceph.io
> >>
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: has anyone enabled bdev_enable_discard?

2024-03-02 Thread Joshua Baergen
Periodic discard was actually attempted in the past:
https://github.com/ceph/ceph/pull/20723

A proper implementation would probably need appropriate
scheduling/throttling that can be tuned so as to balance against
client I/O impact.

Josh

On Sat, Mar 2, 2024 at 6:20 AM David C.  wrote:
>
> Could we not consider setting up a “bluefstrim” which could be orchestrated
> ?
>
> This would avoid having a continuous stream of (D)iscard instructions on
> the disks during activity.
>
> A weekly (probably monthly) bluefstrim could probably be enough for
> platforms that really need it.
>
>
> Le sam. 2 mars 2024 à 12:58, Matt Vandermeulen  a
> écrit :
>
> > We've had a specific set of drives that we've had to enable
> > bdev_enable_discard and bdev_async_discard for in order to maintain
> > acceptable performance on block clusters. I wrote the patch that Igor
> > mentioned in order to try and send more parallel discards to the
> > devices, but these ones in particular seem to process them in serial
> > (based on observed discard counts and latency going to the device),
> > which is unfortunate. We're also testing new firmware that suggests it
> > should help alleviate some of the initial concerns we had about discards
> > not keeping up which prompted the patch in the first place.
> >
> > Most of our drives do not need discards enabled (and definitely not
> > without async) in order to maintain performance unless we're doing a
> > full disk fio test or something like that where we're trying to find its
> > cliff profile. We've used OSD classes to help target the options being
> > applied to specific OSDs via centralized conf which helps when we would
> > add new hosts that may have different drives so that the options weren't
> > applied globally.
> >
> > Based on our experience, I wouldn't enable it unless you're seeing some
> > sort of cliff-like behaviour as your OSDs run low on free space, or are
> > heavily fragmented. I would also deem bdev_async_enabled = 1 to be a
> > requirement so that it doesn't block user IO. Keep an eye on your
> > discards being sent to devices and the discard latency, as well (via
> > node_exporter, for example).
> >
> > Matt
> >
> >
> > On 2024-03-02 06:18, David C. wrote:
> > > I came across an enterprise NVMe used for BlueFS DB whose performance
> > > dropped sharply after a few months of delivery (I won't mention the
> > > brand
> > > here but it was not among these 3: Intel, Samsung, Micron).
> > > It is clear that enabling bdev_enable_discard impacted performance, but
> > > this option also saved the platform after a few days of discard.
> > >
> > > IMHO the most important thing is to validate the behavior when there
> > > has
> > > been a write to the entire flash media.
> > > But this option has the merit of existing.
> > >
> > > it seems to me that the ideal would be not to have several options on
> > > bdev_*discard, and that this task should be asynchronous and with the
> > > (D)iscard instructions during a calmer period of activity (I do not see
> > > any
> > > impact if the instructions are lost during an OSD reboot)
> > >
> > >
> > > Le ven. 1 mars 2024 à 19:17, Igor Fedotov  a
> > > écrit :
> > >
> > >> I played with this feature a while ago and recall it had visible
> > >> negative impact on user operations due to the need to submit tons of
> > >> discard operations - effectively each data overwrite operation
> > >> triggers
> > >> one or more discard operation submission to disk.
> > >>
> > >> And I doubt this has been widely used if any.
> > >>
> > >> Nevertheless recently we've got a PR to rework some aspects of thread
> > >> management for this stuff, see https://github.com/ceph/ceph/pull/55469
> > >>
> > >> The author claimed they needed this feature for their cluster so you
> > >> might want to ask him about their user experience.
> > >>
> > >>
> > >> W.r.t documentation - actually there are just two options
> > >>
> > >> - bdev_enable_discard - enables issuing discard to disk
> > >>
> > >> - bdev_async_discard - instructs whether discard requests are issued
> > >> synchronously (along with disk extents release) or asynchronously
> > >> (using
> > >> a background thread).
> > >>
> > >> Thanks,
> > >>
> > >> Igor
> > >>
> > >> On 01/03/2024 13:06, jst...@proxforge.de wrote:
> > >> > Is there any update on this? Did someone test the option and has
> > >> > performance values before and after?
> > >> > Is there any good documentation regarding this option?
> > >> > ___
> > >> > ceph-users mailing list -- ceph-users@ceph.io
> > >> > To unsubscribe send an email to ceph-users-le...@ceph.io
> > >> ___
> > >> ceph-users mailing list -- ceph-users@ceph.io
> > >> To unsubscribe send an email to ceph-users-le...@ceph.io
> > >>
> > > ___
> > > ceph-users mailing list -- ceph-users@ceph.io
> > > To unsubscribe send an email to ceph-users-le...@c

[ceph-users] Question about erasure coding on cephfs

2024-03-02 Thread Erich Weiler

Hi Y'all,

We have a new ceph cluster online that looks like this:

md-01 : monitor, manager, mds
md-02 : monitor, manager, mds
md-03 : monitor, manager
store-01 : twenty 30TB NVMe OSDs
store-02 : twenty 30TB NVMe OSDs

The cephfs storage is using erasure coding at 4:2.  The crush domain is 
set to "osd".


(I know that's not optimal but let me get to that in a minute)

We have a current regular single NFS server (nfs-01) with the same 
storage as the OSD servers above (twenty 30TB NVME disks).  We want to 
wipe the NFS server and integrate it into the above ceph cluster as 
"store-03".  When we do that, we would then have three OSD servers.  We 
would then switch the crush domain to "host".


My question is this:  Given that we have 4:2 erasure coding, would the 
data rebalance evenly across the three OSD servers after we add store-03 
such that if a single OSD server went down, the other two would be 
enough to keep the system online?  Like, with 4:2 erasure coding, would 
2 shards go on store-01, then 2 shards on store-02, and then 2 shards on 
store-03?  Is that how I understand it?


Thanks for any insight!

-erich
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Question about erasure coding on cephfs

2024-03-02 Thread Anthony D'Atri


> On Mar 2, 2024, at 10:37 AM, Erich Weiler  wrote:
> 
> Hi Y'all,
> 
> We have a new ceph cluster online that looks like this:
> 
> md-01 : monitor, manager, mds
> md-02 : monitor, manager, mds
> md-03 : monitor, manager
> store-01 : twenty 30TB NVMe OSDs
> store-02 : twenty 30TB NVMe OSDs
> 
> The cephfs storage is using erasure coding at 4:2.  The crush domain is set 
> to "osd".
> 
> (I know that's not optimal but let me get to that in a minute)
> 
> We have a current regular single NFS server (nfs-01) with the same storage as 
> the OSD servers above (twenty 30TB NVME disks).  We want to wipe the NFS 
> server and integrate it into the above ceph cluster as "store-03".  When we 
> do that, we would then have three OSD servers.  We would then switch the 
> crush domain to "host".
> 
> My question is this:  Given that we have 4:2 erasure coding, would the data 
> rebalance evenly across the three OSD servers after we add store-03 such that 
> if a single OSD server went down, the other two would be enough to keep the 
> system online?  Like, with 4:2 erasure coding, would 2 shards go on store-01, 
> then 2 shards on store-02, and then 2 shards on store-03?  Is that how I 
> understand it?

Nope.  If the failure domain is *host*, without a carefully-crafted special 
CRUSH rule, CRUSH will want to spread the 6 shards over 6 failure domains, and 
you will only have 3.  I don’t remember for sure if the PGs would be stuck 
remapped or stuck unable to activate, but either way you would have a very bad 
day.

Say you craft a CRUSH rule that places two shards on each host.  One host goes 
down, and you have at most K shards up.  IIRC the PGs will be `inactive` but 
you won’t lose existing data. 

Here sind multiple reasons why for small deployments I favor 1U servers:

* Having enough servers so that if one is down, service can proceed
* Having enough failure domains to do EC — or at least replication — safely
* Is your networking a bottleneck?


Assuming these are QLC SSDs, do you have their min_alloc_size set to match the 
IU?  Ideally you would mix in a couple of TLC OSDs on each server — in this 
case including the control plane — for the CephFS metadata pool.

I’m curious which SSDs you’re using, please write me privately as I have 
history with QLC.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph-crash NOT reporting crashes due to wrong permissions on /var/lib/ceph/crash/posted (Debian / Ubuntu packages)

2024-03-02 Thread Tyler Stachecki
> On 23.02.24 16:18, Christian Rohmann wrote:
> > I just noticed issues with ceph-crash using the Debian /Ubuntu
> > packages (package: ceph-base):
> >
> > While the /var/lib/ceph/crash/posted folder is created by the package
> > install,
> > it's not properly chowned to ceph:ceph by the postinst script.
...
> > You might want to check if you might be affected as well.
> > Failing to post crashes to the local cluster results in them not being
> > reported back via telemetry.
>
> Sorry to bluntly bump this again, but did nobody else notice this on
> your clusters?
> Call me egoistic, but the more clusters return crash reports the more
> stable my Ceph likely becomes ;-)

I do observe the ownership does not match ceph:ceph on Debian with v17.2.7.
$ sudo ls -l /var/lib/ceph/crash | grep posted
drwxr-xr-x 2 root root 4096 Feb 10 19:23 posted

The issue seems to be that the postinst script does not recursively
chown and only chowns subdirectories directly under /var/lib/ceph:
https://github.com/ceph/ceph/blob/91e8cea0d31775de0e59936b3608a9a453353a45/debian/ceph-base.postinst#L40

The rpm spec looks to do subdirectories under /var/lib/ceph as well,
but explicitly lists everything out instead of globs, and also
includes posted:
https://github.com/ceph/ceph/blob/91e8cea0d31775de0e59936b3608a9a453353a45/ceph.spec.in#L1643

Tyler
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io