[ceph-users] Re: osd fast shutdown provokes slow requests

2020-08-17 Thread Manuel Lausch
Hi Dan, I opened the ticket last friday: https://tracker.ceph.com/issues/46978 Manuel On Fri, 14 Aug 2020 17:49:55 +0200 Dan van der Ster wrote: > I think the best course of action would be to open a tracker ticket > with details about your environment and your observations, then the > dev

[ceph-users] Re: osd fast shutdown provokes slow requests

2020-08-14 Thread Dan van der Ster
I think the best course of action would be to open a tracker ticket with details about your environment and your observations, then the devs could try to see if something was overlooked with this change. -- dan On Fri, Aug 14, 2020 at 5:48 PM Manuel Lausch wrote: > > Hi, > > I thought the "fail"

[ceph-users] Re: osd fast shutdown provokes slow requests

2020-08-14 Thread Manuel Lausch
Hi, I thought the "fail" needs to propagated as well. Am I false? Who can have a look if a markdown message in "fast shutdown" mode is a possibilitiy? I do not have the expertice to say if this would breakt something else. But if this is possible I would vote for this. Thanks Manuel On Fri, 1

[ceph-users] Re: osd fast shutdown provokes slow requests

2020-08-14 Thread Dan van der Ster
Hi, I suppose the idea is that it's quicker to fail via the connection refused setting than by waiting for an osdmap to be propagated across the cluster. It looks simple enough in OSD.cc to also send the down message to the mon even with fast shutdown enabled. But I don't have any clue if that wo

[ceph-users] Re: osd fast shutdown provokes slow requests

2020-08-14 Thread Manuel Lausch
Hi Dan, thank you for the link. I read it as well as the linked conversation in the rook project. I don't get it why the fast shutdown should be better than the "normal" shutdown in which the OSD annouces its shutdown directly. Are there cases where the shutdown of the OSD takes longer until its

[ceph-users] Re: osd fast shutdown provokes slow requests

2020-08-14 Thread Dan van der Ster
There's a bit of discussion on this at the original PR: https://github.com/ceph/ceph/pull/31677 Sage claims the IO interruption should be smaller with osd_fast_shutdown than without. -- dan On Fri, Aug 14, 2020 at 10:08 AM Manuel Lausch wrote: > > Hi Dan, > > stopping a single OSD took mostly 1

[ceph-users] Re: osd fast shutdown provokes slow requests

2020-08-14 Thread Manuel Lausch
Hi Dan, stopping a single OSD took mostly 1 to 2 seconds betwenn stop and the first reporting in ceph.log. Stopping a whole node, in this case 24 OSDs, in the most cases it took 5 to 7 seconds. After the reporting peering begins, but this is quite fast. Since I have the fast shutdown disabled. Th

[ceph-users] Re: osd fast shutdown provokes slow requests

2020-08-13 Thread Dan van der Ster
OK I just wanted to confirm you hadn't extended the osd_heartbeat_grace or similar. On your large cluster, what is the time from stopping an osd (with fasst shutdown enabled) to: cluster [DBG] osd.317 reported immediately failed by osd.202 -- dan On Thu, Aug 13, 2020 at 4:38 PM Manuel Lausch

[ceph-users] Re: osd fast shutdown provokes slow requests

2020-08-13 Thread Manuel Lausch
Hi Dan, The only settings in my ceph.conf related to down/out and peering are this. mon osd down out interval = 1800 mon osd down out subtree limit = host mon osd min down reporters = 3 mon osd reporter subtree level = host The Cluster has 44 Hosts รก 24 OSDs Manuel On Thu, 13 Aug 2

[ceph-users] Re: osd fast shutdown provokes slow requests

2020-08-13 Thread Dan van der Ster
Hi Manuel, Just to clarify -- do you override any of the settings related to peer down detection? heartbeat periods or timeouts or min down reporters or anything like that? Cheers, Dan On Thu, Aug 13, 2020 at 3:46 PM Manuel Lausch wrote: > > Hi, > > I investigated an other problem with my nau