I could probably put together a wip branch if you have a test cluster you could try it out on. -Sam
On Thu, Jan 19, 2017 at 2:27 PM, David Turner <david.tur...@storagecraft.com > wrote: > To be clear, we are willing to change to a snap_trim_sleep of 0 and try to > manage it with the other available settings... but it is sounding like that > won't really work for us since our main op thread(s) will just be saturated > with snap trimming almost all day. We currently only have ~6 hours/day > where our snap trim q's are empty. > > ------------------------------ > > <https://storagecraft.com> David Turner | Cloud Operations Engineer | > StorageCraft > Technology Corporation <https://storagecraft.com> > 380 Data Drive Suite 300 | Draper | Utah | 84020 > Office: 801.871.2760 <(801)%20871-2760> | Mobile: 385.224.2943 > <(385)%20224-2943> > > ------------------------------ > > If you are not the intended recipient of this message or received it > erroneously, please notify the sender and delete it, together with any > attachments, and be advised that any dissemination or copying of this > message is prohibited. > > ------------------------------ > > ------------------------------ > *From:* ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of David > Turner [david.tur...@storagecraft.com] > *Sent:* Thursday, January 19, 2017 3:25 PM > *To:* Samuel Just; Nick Fisk > > *Cc:* ceph-users > *Subject:* Re: [ceph-users] osd_snap_trim_sleep keeps locks PG during > sleep? > > We are a couple of weeks away from upgrading to Jewel in our production > clusters (after months of testing in our QA environments), but this might > prevent us from making the migration from Hammer. We delete ~8,000 > snapshots/day between 3 clusters and our snap_trim_q gets up to about 60 > Million in each of those clusters. We have to use an osd_snap_trim_sleep > of 0.25 to prevent our clusters from falling on their faces during our big > load and 0.1 the rest of the day to catch up on the snap trim q. > > Is our setup possible to use on Jewel? > > ------------------------------ > > <https://storagecraft.com> David Turner | Cloud Operations Engineer | > StorageCraft > Technology Corporation <https://storagecraft.com> > 380 Data Drive Suite 300 | Draper | Utah | 84020 > Office: 801.871.2760 <(801)%20871-2760> | Mobile: 385.224.2943 > <(385)%20224-2943> > > ------------------------------ > > If you are not the intended recipient of this message or received it > erroneously, please notify the sender and delete it, together with any > attachments, and be advised that any dissemination or copying of this > message is prohibited. > > ------------------------------ > > ________________________________________ > From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of Samuel > Just [sj...@redhat.com] > Sent: Thursday, January 19, 2017 2:45 PM > To: Nick Fisk > Cc: ceph-users > Subject: Re: [ceph-users] osd_snap_trim_sleep keeps locks PG during sleep? > > Yeah, I think you're probably right. The answer is probably to add an > explicit rate-limiting element to the way the snaptrim events are > scheduled. > -Sam > > On Thu, Jan 19, 2017 at 1:34 PM, Nick Fisk <n...@fisk.me.uk> wrote: > > I will give those both a go and report back, but the more I thinking > about this the less I'm convinced that it's going to help. > > > > I think the problem is a general IO imbalance, there is probably > something like 100+ times more trimming IO than client IO and so even if > client IO gets promoted to the front of the queue by Ceph, once it hits the > Linux IO layer its fighting for itself. I guess this approach works with > scrubbing as each read IO has to wait to be read before the next one is > submitted, so the queue can be managed on the OSD. With trimming, writes > can buffer up below what the OSD controls. > > > > I don't know if the snap trimming goes nuts because the journals are > acking each request and the spinning disks can't keep up, or if it's > something else. Does WBThrottle get involved with snap trimming? > > > > But from an underlying disk perspective, there is definitely more than 2 > snaps per OSD at a time going on, even if the OSD itself is not processing > more than 2 at a time. I think there either needs to be another knob so > that Ceph can throttle back snaps, not just de-prioritise them. Or, there > needs a whole new kernel interface where an application can priority tag > individual IO's for CFQ to handle, instead of the current limitation of > priority per thread, I realise this is probably very very hard or > impossible. But it would allow Ceph to control IO queue's right down to the > disk. > > > >> -----Original Message----- > >> From: Samuel Just [mailto:sj...@redhat.com] > >> Sent: 19 January 2017 18:58 > >> To: Nick Fisk <n...@fisk.me.uk> > >> Cc: Dan van der Ster <d...@vanderster.com>; ceph-users < > ceph-users@lists.ceph.com> > >> Subject: Re: [ceph-users] osd_snap_trim_sleep keeps locks PG during > sleep? > >> > >> Have you also tried setting osd_snap_trim_cost to be 16777216 (16x the > default value, equal to a 16MB IO) and > >> osd_pg_max_concurrent_snap_trims to 1 (from 2)? > >> -Sam > >> > >> On Thu, Jan 19, 2017 at 7:57 AM, Nick Fisk <n...@fisk.me.uk> wrote: > >> > Hi Sam, > >> > > >> > Thanks for the confirmation on both which thread the trimming happens > in and for confirming my suspicion that sleeping is now a > >> bad idea. > >> > > >> > The problem I see is that even with setting the priority for trimming > down low, it still seems to completely swamp the cluster. The > >> trims seem to get submitted in an async nature which seems to leave all > my disks sitting at queue depths of 50+ for several minutes > >> until the snapshot is removed, often also causing several OSD's to get > marked out and start flapping. I'm using WPQ but haven't > >> changed the cutoff variable yet as I know you are working on fixing a > bug with that. > >> > > >> > Nick > >> > > >> >> -----Original Message----- > >> >> From: Samuel Just [mailto:sj...@redhat.com] > >> >> Sent: 19 January 2017 15:47 > >> >> To: Dan van der Ster <d...@vanderster.com> > >> >> Cc: Nick Fisk <n...@fisk.me.uk>; ceph-users > >> >> <ceph-users@lists.ceph.com> > >> >> Subject: Re: [ceph-users] osd_snap_trim_sleep keeps locks PG during > sleep? > >> >> > >> >> Snaptrimming is now in the main op threadpool along with scrub, > >> >> recovery, and client IO. I don't think it's a good idea to use any > of the _sleep configs anymore -- the intention is that by setting the > >> priority low, they won't actually be scheduled much. > >> >> -Sam > >> >> > >> >> On Thu, Jan 19, 2017 at 5:40 AM, Dan van der Ster < > d...@vanderster.com> wrote: > >> >> > On Thu, Jan 19, 2017 at 1:28 PM, Nick Fisk <n...@fisk.me.uk> > wrote: > >> >> >> Hi Dan, > >> >> >> > >> >> >> I carried out some more testing after doubling the op threads, it > >> >> >> may have had a small benefit as potentially some threads are > >> >> >> available, but latency still sits more or less around the > >> >> >> configured snap sleep time. Even more threads might help, but I > >> >> >> suspect you are just > >> >> lowering the chance of IO's that are stuck behind the sleep, rather > than actually solving the problem. > >> >> >> > >> >> >> I'm guessing when the snap trimming was in disk thread, you > >> >> >> wouldn't have noticed these sleeps, but now it's in the op thread > >> >> >> it will just sit there holding up all IO and be a lot more > >> >> >> noticable. It might be > >> >> that this option shouldn't be used with Jewel+? > >> >> > > >> >> > That's a good thought -- so we need confirmation which thread is > >> >> > doing the snap trimming. I honestly can't figure it out from the > >> >> > code -- hopefully a dev could explain how it works. > >> >> > > >> >> > Otherwise, I don't have much practical experience with snap > >> >> > trimming in jewel yet -- our RBD cluster is still running 0.94.9. > >> >> > > >> >> > Cheers, Dan > >> >> > > >> >> > > >> >> >> > >> >> >>> -----Original Message----- > >> >> >>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On > >> >> >>> Behalf Of Nick Fisk > >> >> >>> Sent: 13 January 2017 20:38 > >> >> >>> To: 'Dan van der Ster' <d...@vanderster.com> > >> >> >>> Cc: 'ceph-users' <ceph-users@lists.ceph.com> > >> >> >>> Subject: Re: [ceph-users] osd_snap_trim_sleep keeps locks PG > during sleep? > >> >> >>> > >> >> >>> We're on Jewel and your right, I'm pretty sure the snap stuff is > also now handled in the op thread. > >> >> >>> > >> >> >>> The dump historic ops socket command showed a 10s delay at the > >> >> >>> "Reached PG" stage, from Greg's response [1], it would suggest > >> >> >>> that the OSD itself isn't blocking but the PG it's currently > >> >> >>> sleeping whilst trimming. I think in the former case, it would > >> >> >>> have a > >> >> >> high time > >> >> >>> on the "Started" part of the op? Anyway I will carry out some > >> >> >>> more testing with higher osd op threads and see if that makes > any difference. Thanks for the suggestion. > >> >> >>> > >> >> >>> Nick > >> >> >>> > >> >> >>> > >> >> >>> [1] > >> >> >>> http://lists.ceph.com/pipermail/ceph-users-ceph.com/ > 2016-March/00 > >> >> >>> 865 > >> >> >>> 2.html > >> >> >>> > >> >> >>> > -----Original Message----- > >> >> >>> > From: Dan van der Ster [mailto:d...@vanderster.com] > >> >> >>> > Sent: 13 January 2017 10:28 > >> >> >>> > To: Nick Fisk <n...@fisk.me.uk> > >> >> >>> > Cc: ceph-users <ceph-users@lists.ceph.com> > >> >> >>> > Subject: Re: [ceph-users] osd_snap_trim_sleep keeps locks PG > during sleep? > >> >> >>> > > >> >> >>> > Hammer or jewel? I've forgotten which thread pool is handling > >> >> >>> > the snap trim nowadays -- is it the op thread yet? If so, > >> >> >>> > perhaps all the op threads are stuck sleeping? Just a wild > >> >> >>> > guess. (Maybe > >> >> >> increasing # > >> >> >>> op threads would help?). > >> >> >>> > > >> >> >>> > -- Dan > >> >> >>> > > >> >> >>> > > >> >> >>> > On Thu, Jan 12, 2017 at 3:11 PM, Nick Fisk <n...@fisk.me.uk> > wrote: > >> >> >>> > > Hi, > >> >> >>> > > > >> >> >>> > > I had been testing some higher values with the > >> >> >>> > > osd_snap_trim_sleep variable to try and reduce the impact of > >> >> >>> > > removing RBD snapshots on our cluster and I have come across > >> >> >>> > > what I believe to be a possible unintended consequence. The > >> >> >>> > > value of the sleep seems to keep the > >> >> >>> > lock on the PG open so that no other IO can use the PG whilst > the snap removal operation is sleeping. > >> >> >>> > > > >> >> >>> > > I had set the variable to 10s to completely minimise the > >> >> >>> > > impact as I had some multi TB snapshots to remove and noticed > >> >> >>> > > that suddenly all IO to the cluster had a latency of roughly > >> >> >>> > > 10s as well, all the > >> >> >>> > dumped ops show waiting on PG for 10s as well. > >> >> >>> > > > >> >> >>> > > Is the osd_snap_trim_sleep variable only ever meant to be > >> >> >>> > > used up to say a max of 0.1s and this is a known side effect, > >> >> >>> > > or should the lock on the PG be removed so that normal IO can > >> >> >>> > > continue during the > >> >> >>> > sleeps? > >> >> >>> > > > >> >> >>> > > Nick > >> >> >>> > > > >> >> >>> > > _______________________________________________ > >> >> >>> > > ceph-users mailing list > >> >> >>> > > ceph-users@lists.ceph.com > >> >> >>> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >> >> >>> > >> >> >>> _______________________________________________ > >> >> >>> ceph-users mailing list > >> >> >>> ceph-users@lists.ceph.com > >> >> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >> >> >> > >> >> > _______________________________________________ > >> >> > ceph-users mailing list > >> >> > ceph-users@lists.ceph.com > >> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >> > > > > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com