Coincidentally, we've been suffering from split-induced slow requests on one of our clusters for the past week.
I wanted to add that it isn't at all obvious when slow requests are being caused by filestore splitting. (When you increase the filestore/osd logs to 10, probably also 20, all you see is that an object write is taking >30s, which seems totally absurd.) So only after a lot of head scratching I noticed this thread and realized it could be the splitting -- sure enough, our PGs were crossing the 5120 object threshold, one-by-one at a rate of around 5-10 PGs per hour. I've just sent this PR for comments: https://github.com/ceph/ceph/pull/12421 IMHO, this (or something similar) would help operators a bunch in identifying when this is happening. Thanks! Dan On Fri, Dec 9, 2016 at 7:27 PM, David Turner <david.tur...@storagecraft.com> wrote: > Our 32k PGs each have about 25-30k objects (25-30GB per PG). When we > first contracted with Redhat support, they recommended for us to have our > setting at about 4000 files per directory before splitting into > subfolders. When we split into subfolders with that setting, an > osd_heartbeat_grace (how long before an OSD can't be reached before > reporting it down to the MONs) of 60 was needed to not flap OSDs during > subfolder splitting. > > With the plan to go back and lower the setting again, we would increase > that setting to make it through a holiday weekend or a time where we needed > to have higher performance. When we went to lower it, it was too painful > to get through and now we're at what looks like a hardcoded maximum of > 12,800 objects per subfolder before a split is forced. At the amount of > objects now, we have to use an osd_heartbeat_grace of 240 to avoid flapping > OSDs during subfolder splitting. > > Unless you NEED to merge your subfolders, you can set your filestore merge > threshold to a negative number and it will never merge. The equation for > knowing when to split further takes the absolute value of the merge > threshold so you can just invert it to a negative number and not change the > behavior of splitting while disabling merging. > > The OSDs flapping is unrelated to the 10.2.3 bug. We're currently on > 0.94.7 and have had this problem since Firefly. The flapping is due to the > OSD being so involved in the process to split the subfolder that it isn't > responding to other requests, that's why using osd_heartbeat_grace gets us > through the splitting. > > 1) We do not have SELinux installed on our Ubuntu servers. > > 2) We monitor and manage our fragmentation and haven't seen much of an > issue since we increased our alloc_size in the mount options for XFS. > > "5) pre-splitting PGs is I think the right answer." Pre-splitting PGs is > counter-intuitive. It's a good theory, but an ineffective practice. When > a PG backfills to a new OSD it builds the directory structure according to > the current settings of how deep the folder structure should be. So if you > lose a drive or add storage, all of the PGs that move are no longer > pre-split to where you think they are. We have seen multiple times where > PGs are different depths on different OSDs. It is not a PG state as to how > deep it's folder structure is, but a local state per copy of the PG on each > OSD. > > > Ultimately we're looking to Bluestore to be our Knight in Shining Armor to > come and save us from all of this, but in the meantime, I have a couple > ideas for how to keep our clusters usable. > > We add storage regularly without our cluster being completely unusable. I > took that idea and am testing this with some OSDs to weight the OSDs to 0, > backfill all of the data off, restart them with new split/merge thresholds, > and backfill data back onto them. This would build the PG's on the OSDs > with the current settings and get us away from the 12,800 objects setting > we're stuck at now. The next round will weight the next set of drives to 0 > while we start to backfill onto the previous drives with the new settings. > I have some very efficient weighting techniques that keep the cluster > balanced while doing this, but it did take 2 days to finish backfilling off > of the 32 drives. Cluster performance was fairly poor during this and I > can only do 3 out of our 30 nodes at a time.... which is a long time of > running in a degraded state. > > The modification to the ceph-objectstore-tool in 10.2.4 and 0.94.10 looks > very promising to help us manage this. Doing the splits offline would work > out quite well for us. We're testing our QA environment with 10.2.3 and > are putting some of that testing on hold until 10.2.4 is fixed. > > ------------------------------ > > <https://storagecraft.com> David Turner | Cloud Operations Engineer | > StorageCraft > Technology Corporation <https://storagecraft.com> > 380 Data Drive Suite 300 | Draper | Utah | 84020 > Office: 801.871.2760 <(801)%20871-2760> | Mobile: 385.224.2943 > <(385)%20224-2943> > > ------------------------------ > > If you are not the intended recipient of this message or received it > erroneously, please notify the sender and delete it, together with any > attachments, and be advised that any dissemination or copying of this > message is prohibited. > > ------------------------------ > > ________________________________________ > From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of Mark > Nelson [mnel...@redhat.com] > Sent: Thursday, December 08, 2016 10:25 AM > To: ceph-users@lists.ceph.com > Subject: Re: [ceph-users] filestore_split_multiple hardcoded maximum? > > > I don't want to retype it all, but you guys might be interested in the > discussion under section 3 of this post here: > > http://lists.ceph.com/pipermail/ceph-users-ceph.com/ > 2016-September/012987.html > > basically the gist of it is: > > 1) Make sure SELinux isn't doing security xattr lookups for link/unlink > operations (this makes splitting incredibly painful!). You may need to > disable SELinux. > > 2) xfs sticks files in a given directory in the same AG (ie portion of > the disk), but a subdirectory may end up with a different AG than a > parent directory. As the split depth grows, so does fragmentation due > to files from the parent directories moving into the new sub directories > that have a different AG. > > 3) Increasing the split depth is an object, but more files in a single > directory will cause readdir to slowdown. The effect is fairly minimal > even at ~10k files relative to the other costs involved. > > 4) The bigger issue that high split thresholds require more work to > happen for every split, but this is somewhat offset as splits tend to > happen over a larger time range due to the inherent randomness is pg > data distribution being amplified. Still, when compounded with point 1 > above, when large splits happen it can be debilitating. > > 5) pre-splitting PGs is I think the right answer. It should greatly > delay the onset of directory fragmentation, avoid a lot of early > linking/relinking, and in some cases (like RBD) potentially avoid any > additional splits altogether. The cost is increased inode cache misses > when there aren't many objects in the cluster yet. This could make > benchmarks on fresh clusters slower, but yield better behavior as the > cluster grows. > > Mark > > On 12/08/2016 05:23 AM, Frédéric Nass wrote: > > Hi David, > > > > I'm surprised your message didn't get any echo yet. I guess it depends > > on how many files your OSDs get to store on filesystem which depends > > essentialy on use cases. > > > > We're having similar issues with a 144 osd cluster running 2 pools. Each > > one holds 100 M objects.One is replication x3 (256 PGs) and the other is > > EC k=5, m=4 (512 PGs). > > That's 300 M + 900 M = 1.2 B files stored on XFS filesystem. > > > > We're observing that our PGs subfolders only holds around 120 files each > > when they should holds around 320 (we're using default split / merge > > values). > > All objetcs were created when cluster was running Hammer. We're now > > running Jewel (RHCS 2.0 actually). > > > > We ran some tests on a Jewel backup infrastructure. Split happens at > > around 320 files per directory, as expected. > > We have no idea why we're not seeing 320 files per PG subfolder on our > > production cluster pools. > > > > Everything we read suggests to raise the filestore_merge_threshold and > > filestore_split_multiple values to 40 / 8 : > > > > https://www.redhat.com/en/files/resources/en-rhst- > cephstorage-supermicro-INC0270868_v2_0715.pdf > > https://bugzilla.redhat.com/show_bug.cgi?id=1219974 > > http://lists.ceph.com/pipermail/ceph-users-ceph.com/ > 2014-July/041179.html > > http://lists.ceph.com/pipermail/ceph-users-ceph.com/ > 2016-September/012987.html > > > > We now need to merge directories (when you need to split apparently :-) > > > > We will do so, by increasing the filestore_merge_threshold in 10 units > > steps until maybe 120 to lower it back to 40. > > Between each steps we'll run 'rados bench' (in cleanup mode) on both > > pools to generate enough deletes operations to trigger merges operations > > on each PGs. > > By running the 'rados bench' at night our clients won't be much impacted > > by blocked requests. > > > > Running this on you cluster would also provoke split when rados bench > > writes to the pools. > > > > Also, note that you can set merge and split values to a specific OSD in > > ceph.conf ([osd.123]) so you can see how the OSD reorganizes the PGs > > tree when running a 'rados bench'. > > > > Regarding the OSDs flapping, does this happen when scrubbing ? You may > > hit the Jewel scrubbing bug Sage reported like 3 weeks ago (look for > > 'stalls caused by scrub on jewel'). > > It's fixed in 10.2.4 and waiting for QA to make it to RHCS >= 2.0 > > > > We are impacted by this bug because we have a lot of objects (200k) per > > PGs with, I think, bad split / merge values. Lowering vfs_cache_pressure > > to 1 might also help to avoid the flapping. > > > > Regards, > > > > Frederic Nass, > > Université de Lorraine. > > > > ----- Le 27 Sep 16, à 0:42, David Turner <david.tur...@storagecraft.com> > > a écrit : > > > > We are running on Hammer 0.94.7 and have had very bad experiences > > with PG folders splitting a sub-directory further. OSDs being > > marked out, hundreds of blocked requests, etc. We have modified our > > settings and watched the behavior match the ceph documentation for > > splitting, but right now the subfolders are splitting outside of > > what the documentation says they should. > > > > filestore_split_multiple * abs(filestore_merge_threshold) * 16 > > > > Our filestore_merge_threshold is set to 40. When we had our > > filestore_split_multiple set to 8, we were splitting subfolders when > > a subfolder had (8 * 40 * 16 = ) 5120 objects in the directory. In > > a different cluster we had to push that back again with elevated > > settings and the subfolders split when they had (16 * 40 * 16 = ) > > 10240 objects. > > > > We have another cluster that we're working with that is splitting at > > a value that seems to be a hardcoded maximum. The settings are (32 > > * 40 * 16 = ) 20480 objects before it should split, but it seems to > > be splitting subfolders at 12800 objects. > > > > Normally I would expect this number to be a power of 2, but we > > recently found another hardcoded maximum of the object map only > > allowing RBD's with a maximum 256,000,000 objects in them. The > > 12800 matches that as being a base 2 followed by a set of zero's to > > be the hardcoded maximum. > > > > Has anyone else encountered what seems to be a hardcoded maximum > > here? Are we missing a setting elsewhere that is capping us, or > > diminishing our value? Much more to the point, though, is there any > > way to mitigate how painful it is to split subfolders in PGs? So > > far it seems like the only way we can do it is to push up the > > setting to later drop it back down during a week that we plan to > > have our cluster plagued with blocked requests all while cranking > > our osd_heartbeat_grace so that we don't have flapping osds. > > > > A little more about our setup is that we have 32x 4TB HGST drives > > with 4x 200GB Intel DC3710 journals (8 drives per journal), dual > > hyper-threaded octa-core Xeon (32 virtual cores), 192GB memory, 10Gb > > redundant network... per storage node. > > > > -------------------------------------------------------- > ---------------- > > > > <https://storagecraft.com> > > David Turner | Cloud Operations Engineer | StorageCraft Technology > > Corporation <https://storagecraft.com> > > 380 Data Drive Suite 300 | Draper | Utah | 84020 > > Office: 801.871.2760 <(801)%20871-2760>| Mobile: 385.224.2943 > <(385)%20224-2943> > > > > -------------------------------------------------------- > ---------------- > > > > If you are not the intended recipient of this message or received it > > erroneously, please notify the sender and delete it, together with > > any attachments, and be advised that any dissemination or copying of > > this message is prohibited. > > > > -------------------------------------------------------- > ---------------- > > > > > > _______________________________________________ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > > > > _______________________________________________ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com