Coincidentally, we've been suffering from split-induced slow requests on
one of our clusters for the past week.

I wanted to add that it isn't at all obvious when slow requests are being
caused by filestore splitting. (When you increase the filestore/osd logs to
10, probably also 20, all you see is that an object write is taking >30s,
which seems totally absurd.) So only after a lot of head scratching I
noticed this thread and realized it could be the splitting -- sure enough,
our PGs were crossing the 5120 object threshold, one-by-one at a rate of
around 5-10 PGs per hour.

I've just sent this PR for comments:

   https://github.com/ceph/ceph/pull/12421

IMHO, this (or something similar) would help operators a bunch in
identifying when this is happening.

Thanks!

Dan



On Fri, Dec 9, 2016 at 7:27 PM, David Turner <david.tur...@storagecraft.com>
wrote:

> Our 32k PGs each have about 25-30k objects (25-30GB per PG).  When we
> first contracted with Redhat support, they recommended for us to have our
> setting at about 4000 files per directory before splitting into
> subfolders.  When we split into subfolders with that setting, an
> osd_heartbeat_grace (how long before an OSD can't be reached before
> reporting it down to the MONs) of 60 was needed to not flap OSDs during
> subfolder splitting.
>
> With the plan to go back and lower the setting again, we would increase
> that setting to make it through a holiday weekend or a time where we needed
> to have higher performance.  When we went to lower it, it was too painful
> to get through and now we're at what looks like a hardcoded maximum of
> 12,800 objects per subfolder before a split is forced.  At the amount of
> objects now, we have to use an osd_heartbeat_grace of 240 to avoid flapping
> OSDs during subfolder splitting.
>
> Unless you NEED to merge your subfolders, you can set your filestore merge
> threshold to a negative number and it will never merge.  The equation for
> knowing when to split further takes the absolute value of the merge
> threshold so you can just invert it to a negative number and not change the
> behavior of splitting while disabling merging.
>
> The OSDs flapping is unrelated to the 10.2.3 bug.  We're currently on
> 0.94.7 and have had this problem since Firefly.  The flapping is due to the
> OSD being so involved in the process to split the subfolder that it isn't
> responding to other requests, that's why using osd_heartbeat_grace gets us
> through the splitting.
>
> 1) We do not have SELinux installed on our Ubuntu servers.
>
> 2) We monitor and manage our fragmentation and haven't seen much of an
> issue since we increased our alloc_size in the mount options for XFS.
>
> "5) pre-splitting PGs is I think the right answer."  Pre-splitting PGs is
> counter-intuitive.  It's a good theory, but an ineffective practice.  When
> a PG backfills to a new OSD it builds the directory structure according to
> the current settings of how deep the folder structure should be.  So if you
> lose a drive or add storage, all of the PGs that move are no longer
> pre-split to where you think they are.  We have seen multiple times where
> PGs are different depths on different OSDs.  It is not a PG state as to how
> deep it's folder structure is, but a local state per copy of the PG on each
> OSD.
>
>
> Ultimately we're looking to Bluestore to be our Knight in Shining Armor to
> come and save us from all of this, but in the meantime, I have a couple
> ideas for how to keep our clusters usable.
>
> We add storage regularly without our cluster being completely unusable.  I
> took that idea and am testing this with some OSDs to weight the OSDs to 0,
> backfill all of the data off, restart them with new split/merge thresholds,
> and backfill data back onto them.  This would build the PG's on the OSDs
> with the current settings and get us away from the 12,800 objects setting
> we're stuck at now.  The next round will weight the next set of drives to 0
> while we start to backfill onto the previous drives with the new settings.
>  I have some very efficient weighting techniques that keep the cluster
> balanced while doing this, but it did take 2 days to finish backfilling off
> of the 32 drives.  Cluster performance was fairly poor during this and I
> can only do 3 out of our 30 nodes at a time.... which is a long time of
> running in a degraded state.
>
> The modification to the ceph-objectstore-tool in 10.2.4 and 0.94.10 looks
> very promising to help us manage this.  Doing the splits offline would work
> out quite well for us.  We're testing our QA environment with 10.2.3 and
> are putting some of that testing on hold until 10.2.4 is fixed.
>
> ------------------------------
>
> <https://storagecraft.com> David Turner | Cloud Operations Engineer | 
> StorageCraft
> Technology Corporation <https://storagecraft.com>
> 380 Data Drive Suite 300 | Draper | Utah | 84020
> Office: 801.871.2760 <(801)%20871-2760> | Mobile: 385.224.2943
> <(385)%20224-2943>
>
> ------------------------------
>
> If you are not the intended recipient of this message or received it
> erroneously, please notify the sender and delete it, together with any
> attachments, and be advised that any dissemination or copying of this
> message is prohibited.
>
> ------------------------------
>
> ________________________________________
> From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of Mark
> Nelson [mnel...@redhat.com]
> Sent: Thursday, December 08, 2016 10:25 AM
> To: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] filestore_split_multiple hardcoded maximum?
>
>
> I don't want to retype it all, but you guys might be interested in the
> discussion under section 3 of this post here:
>
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/
> 2016-September/012987.html
>
> basically the gist of it is:
>
> 1) Make sure SELinux isn't doing security xattr lookups for link/unlink
> operations (this makes splitting incredibly painful!).  You may need to
> disable SELinux.
>
> 2) xfs sticks files in a given directory in the same AG (ie portion of
> the disk), but a subdirectory may end up with a different AG than a
> parent directory.  As the split depth grows, so does fragmentation due
> to files from the parent directories moving into the new sub directories
> that have a different AG.
>
> 3) Increasing the split depth is an object, but more files in a single
> directory will cause readdir to slowdown.  The effect is fairly minimal
> even at ~10k files relative to the other costs involved.
>
> 4) The bigger issue that high split thresholds require more work to
> happen for every split, but this is somewhat offset as splits tend to
> happen over a larger time range due to the inherent randomness is pg
> data distribution being amplified.  Still, when compounded with point 1
> above, when large splits happen it can be debilitating.
>
> 5) pre-splitting PGs is I think the right answer.  It should greatly
> delay the onset of directory fragmentation, avoid a lot of early
> linking/relinking, and in some cases (like RBD) potentially avoid any
> additional splits altogether.  The cost is increased inode cache misses
> when there aren't many objects in the cluster yet.  This could make
> benchmarks on fresh clusters slower, but yield better behavior as the
> cluster grows.
>
> Mark
>
> On 12/08/2016 05:23 AM, Frédéric Nass wrote:
> > Hi David,
> >
> > I'm surprised your message didn't get any echo yet. I guess it depends
> > on how many files your OSDs get to store on filesystem which depends
> > essentialy on use cases.
> >
> > We're having similar issues with a 144 osd cluster running 2 pools. Each
> > one holds 100 M objects.One is replication x3 (256 PGs) and the other is
> > EC k=5, m=4 (512 PGs).
> > That's 300 M + 900 M = 1.2 B files stored on XFS filesystem.
> >
> > We're observing that our PGs subfolders only holds around 120 files each
> > when they should holds around 320 (we're using default split / merge
> > values).
> > All objetcs were created when cluster was running Hammer. We're now
> > running Jewel (RHCS 2.0 actually).
> >
> > We ran some tests on a Jewel backup infrastructure. Split happens at
> > around 320 files per directory, as expected.
> > We have no idea why we're not seeing 320 files per PG subfolder on our
> > production cluster pools.
> >
> > Everything we read suggests to raise the filestore_merge_threshold and
> > filestore_split_multiple values to 40 / 8 :
> >
> > https://www.redhat.com/en/files/resources/en-rhst-
> cephstorage-supermicro-INC0270868_v2_0715.pdf
> > https://bugzilla.redhat.com/show_bug.cgi?id=1219974
> > http://lists.ceph.com/pipermail/ceph-users-ceph.com/
> 2014-July/041179.html
> > http://lists.ceph.com/pipermail/ceph-users-ceph.com/
> 2016-September/012987.html
> >
> > We now need to merge directories (when you need to split apparently :-)
> >
> > We will do so, by increasing the filestore_merge_threshold in 10 units
> > steps until maybe 120 to lower it back to 40.
> > Between each steps we'll run 'rados bench' (in cleanup mode) on both
> > pools to generate enough deletes operations to trigger merges operations
> > on each PGs.
> > By running the 'rados bench' at night our clients won't be much impacted
> > by blocked requests.
> >
> > Running this on you cluster would also provoke split when rados bench
> > writes to the pools.
> >
> > Also, note that you can set merge and split values to a specific OSD in
> > ceph.conf ([osd.123]) so you can see how the OSD reorganizes the PGs
> > tree when running a 'rados bench'.
> >
> > Regarding the OSDs flapping, does this happen when scrubbing ? You may
> > hit the Jewel scrubbing bug Sage reported like 3 weeks ago (look for
> > 'stalls caused by scrub on jewel').
> > It's fixed in 10.2.4 and waiting for QA to make it to RHCS >= 2.0
> >
> > We are impacted by this bug because we have a lot of objects (200k) per
> > PGs with, I think, bad split / merge values. Lowering vfs_cache_pressure
> > to 1 might also help to avoid the flapping.
> >
> > Regards,
> >
> > Frederic Nass,
> > Université de Lorraine.
> >
> > ----- Le 27 Sep 16, à 0:42, David Turner <david.tur...@storagecraft.com>
> > a écrit :
> >
> >     We are running on Hammer 0.94.7 and have had very bad experiences
> >     with PG folders splitting a sub-directory further.  OSDs being
> >     marked out, hundreds of blocked requests, etc.  We have modified our
> >     settings and watched the behavior match the ceph documentation for
> >     splitting, but right now the subfolders are splitting outside of
> >     what the documentation says they should.
> >
> >     filestore_split_multiple * abs(filestore_merge_threshold) * 16
> >
> >     Our filestore_merge_threshold is set to 40.  When we had our
> >     filestore_split_multiple set to 8, we were splitting subfolders when
> >     a subfolder had (8 * 40 * 16 = ) 5120 objects in the directory.  In
> >     a different cluster we had to push that back again with elevated
> >     settings and the subfolders split when they had (16 * 40 * 16 = )
> >     10240 objects.
> >
> >     We have another cluster that we're working with that is splitting at
> >     a value that seems to be a hardcoded maximum.  The settings are (32
> >     * 40 * 16 = ) 20480 objects before it should split, but it seems to
> >     be splitting subfolders at 12800 objects.
> >
> >     Normally I would expect this number to be a power of 2, but we
> >     recently found another hardcoded maximum of the object map only
> >     allowing RBD's with a maximum 256,000,000 objects in them.  The
> >     12800 matches that as being a base 2 followed by a set of zero's to
> >     be the hardcoded maximum.
> >
> >     Has anyone else encountered what seems to be a hardcoded maximum
> >     here?  Are we missing a setting elsewhere that is capping us, or
> >     diminishing our value?  Much more to the point, though, is there any
> >     way to mitigate how painful it is to split subfolders in PGs?  So
> >     far it seems like the only way we can do it is to push up the
> >     setting to later drop it back down during a week that we plan to
> >     have our cluster plagued with blocked requests all while cranking
> >     our osd_heartbeat_grace so that we don't have flapping osds.
> >
> >     A little more about our setup is that we have 32x 4TB HGST drives
> >     with 4x 200GB Intel DC3710 journals (8 drives per journal), dual
> >     hyper-threaded octa-core Xeon (32 virtual cores), 192GB memory, 10Gb
> >     redundant network... per storage node.
> >
> >     --------------------------------------------------------
> ----------------
> >
> >     <https://storagecraft.com>
> >       David Turner | Cloud Operations Engineer | StorageCraft Technology
> >     Corporation <https://storagecraft.com>
> >     380 Data Drive Suite 300 | Draper | Utah | 84020
> >     Office: 801.871.2760 <(801)%20871-2760>| Mobile: 385.224.2943
> <(385)%20224-2943>
> >
> >     --------------------------------------------------------
> ----------------
> >
> >     If you are not the intended recipient of this message or received it
> >     erroneously, please notify the sender and delete it, together with
> >     any attachments, and be advised that any dissemination or copying of
> >     this message is prohibited.
> >
> >     --------------------------------------------------------
> ----------------
> >
> >
> >     _______________________________________________
> >     ceph-users mailing list
> >     ceph-users@lists.ceph.com
> >     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> >
> >
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to