Hi David, 

I'm surprised your message didn't get any echo yet. I guess it depends on how 
many files your OSDs get to store on filesystem which depends essentialy on use 
cases. 

We're having similar issues with a 144 osd cluster running 2 pools. Each one 
holds 100 M objects.One is replication x3 (256 PGs) and the other is EC k=5, 
m=4 (512 PGs). 
That's 300 M + 900 M = 1.2 B files stored on XFS filesystem. 

We're observing that our PGs subfolders only holds around 120 files each when 
they should holds around 320 (we're using default split / merge values). 
All objetcs were created when cluster was running Hammer. We're now running 
Jewel (RHCS 2.0 actually). 

We ran some tests on a Jewel backup infrastructure. Split happens at around 320 
files per directory, as expected. 
We have no idea why we're not seeing 320 files per PG subfolder on our 
production cluster pools. 

Everything we read suggests to raise the filestore_merge_threshold and 
filestore_split_multiple values to 40 / 8 : 

https://www.redhat.com/en/files/resources/en-rhst-cephstorage-supermicro-INC0270868_v2_0715.pdf
 
https://bugzilla.redhat.com/show_bug.cgi?id=1219974 
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-July/041179.html 
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-September/012987.html 

We now need to merge directories (when you need to split apparently :-) 

We will do so, by increasing the filestore_merge_threshold in 10 units steps 
until maybe 120 to lower it back to 40. 
Between each steps we'll run 'rados bench' (in cleanup mode) on both pools to 
generate enough deletes operations to trigger merges operations on each PGs. 
By running the 'rados bench' at night our clients won't be much impacted by 
blocked requests. 

Running this on you cluster would also provoke split when rados bench writes to 
the pools. 

Also, note that you can set merge and split values to a specific OSD in 
ceph.conf ([osd.123]) so you can see how the OSD reorganizes the PGs tree when 
running a 'rados bench'. 

Regarding the OSDs flapping, does this happen when scrubbing ? You may hit the 
Jewel scrubbing bug Sage reported like 3 weeks ago (look for 'stalls caused by 
scrub on jewel'). 
It's fixed in 10.2.4 and waiting for QA to make it to RHCS >= 2.0 

We are impacted by this bug because we have a lot of objects (200k) per PGs 
with, I think, bad split / merge values. Lowering vfs_cache_pressure to 1 might 
also help to avoid the flapping. 

Regards, 

Frederic Nass, 
Université de Lorraine. 

----- Le 27 Sep 16, à 0:42, David Turner <david.tur...@storagecraft.com> a 
écrit : 

> We are running on Hammer 0.94.7 and have had very bad experiences with PG
> folders splitting a sub-directory further. OSDs being marked out, hundreds of
> blocked requests, etc. We have modified our settings and watched the behavior
> match the ceph documentation for splitting, but right now the subfolders are
> splitting outside of what the documentation says they should.

> filestore_split_multiple * abs(filestore_merge_threshold) * 16

> Our filestore_merge_threshold is set to 40. When we had our
> filestore_split_multiple set to 8, we were splitting subfolders when a
> subfolder had (8 * 40 * 16 = ) 5120 objects in the directory. In a different
> cluster we had to push that back again with elevated settings and the
> subfolders split when they had (16 * 40 * 16 = ) 10240 objects.

> We have another cluster that we're working with that is splitting at a value
> that seems to be a hardcoded maximum. The settings are (32 * 40 * 16 = ) 20480
> objects before it should split, but it seems to be splitting subfolders at
> 12800 objects.

> Normally I would expect this number to be a power of 2, but we recently found
> another hardcoded maximum of the object map only allowing RBD's with a maximum
> 256,000,000 objects in them. The 12800 matches that as being a base 2 followed
> by a set of zero's to be the hardcoded maximum.

> Has anyone else encountered what seems to be a hardcoded maximum here? Are we
> missing a setting elsewhere that is capping us, or diminishing our value? Much
> more to the point, though, is there any way to mitigate how painful it is to
> split subfolders in PGs? So far it seems like the only way we can do it is to
> push up the setting to later drop it back down during a week that we plan to
> have our cluster plagued with blocked requests all while cranking our
> osd_heartbeat_grace so that we don't have flapping osds.

> A little more about our setup is that we have 32x 4TB HGST drives with 4x 
> 200GB
> Intel DC3710 journals (8 drives per journal), dual hyper-threaded octa-core
> Xeon (32 virtual cores), 192GB memory, 10Gb redundant network... per storage
> node.


>       David Turner | Cloud Operations Engineer | StorageCraft Technology 
> Corporation
> 380 Data Drive Suite 300 | Draper | Utah | 84020
> Office: 801.871.2760 | Mobile: 385.224.2943

>       If you are not the intended recipient of this message or received it
>       erroneously, please notify the sender and delete it, together with any
>       attachments, and be advised that any dissemination or copying of this 
> message
>       is prohibited.

> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to