Adding the right dev list. ---------------- Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1
On Wed, May 20, 2020 at 12:40 AM Robert LeBlanc <rob...@leblancnet.us> wrote: > > We upgraded our Jewel cluster to Nautilus a few months ago and I've noticed > that op behavior has changed. This is an HDD cluster (NVMe journals and NVMe > CephFS metadata pool) with about 800 OSDs. When on Jewel and running WPQ with > the high cut-off, it was rock solid. When we had recoveries going on it > barely dented the client ops and when the client ops on the cluster went down > the backfills would run as fast as the cluster could go. I could have > max_backfills set to 10 and the cluster performed admirably. > After upgrading to Nautilus the cluster struggles with any kind of recovery > and if there is any significant client write load the cluster can get into a > death spiral. Even heavy client write bandwidth (3-4 GB/s) can cause the > heartbeat checks to raise, blocked IO and even OSDs becoming unresponsive. > As the person who wrote the WPQ code initially, I know that it was fair and > proportional to the op priority and in Jewel it worked. It's not working in > Nautilus. I've tweaked a lot of things trying to troubleshoot the issue and > setting the recovery priority to 1 or zero barely makes any difference. My > best estimation is that the op priority is getting lost before reaching the > WPQ scheduler and is thus not prioritizing and dispatching ops correctly. > It's almost as if all ops are being treated the same and there is no priority > at all. > Unfortunately, I do not have the time to set up the dev/testing environment > to track this down and we will be moving away from Ceph. But I really like > Ceph and want to see it succeed. I strongly suggest that someone look into > this because I think it will resolve a lot of problems people have had on the > mailing list. I'm not sure if a bug was introduced with the other queues that > touches more of the op path or if something in the op path restructuring that > changed how things work (I know that was being discussed around the time that > Jewel was released). But my guess is that it is somewhere between the op > being created and being received into the queue. > I really hope that this helps in the search for this regression. I spent a > lot of time studying the issue to come up with WPQ and saw it work great when > I switched this cluster from PRIO to WPQ. I've also spent countless hours > studying how it's changed in Nautilus. > > Thank you, > Robert LeBlanc > ---------------- > Robert LeBlanc > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io