Re: [ceph-users] osd down detection broken in jewel?

Gregory Farnum Mon, 12 Dec 2016 22:25:39 -0800

On Wed, Nov 30, 2016 at 8:31 AM, Manuel Lausch <manuel.lau...@1und1.de>
wrote:


> Yes. This parameter is used in the condition described there:
> http://docs.ceph.com/docs/jewel/rados/configuration/mon-
> osd-interaction/#osds-report-their-status and works. I think the default
> timeout of 900s is quiet a bit large.
>
> Also in the documentation is a other function wich checks the health of
> OSDs and report them down: http://docs.ceph.com/docs/
> jewel/rados/configuration/mon-osd-interaction/#osds-report-down-osds
>
> As far as I see in the sourcode this documentation is not valid anymore!
> I found this commit -> https://github.com/ceph/ceph/commit/
> bcb8f362ec6ac47c4908118e7860dec7971d001f#diff-
> 0a5db46a44ae9900e226289a810f10e8
>
> "mon_osd_min_down_reporters" now is the threshold how many "
> mon_osd_reporter_subtree_level" has to report a down OSD. in Hammer this
> was how many other OSDs had to report. And in Hammer there was also the
> parameter "mon_osd_min_down_reports" which sets how often a other OSD has
> to report a other OSD. In Jewel the parameter doesn't exists anymore.
>
> With this "knowlege" I adjusted my configuration.  And will now test it.
>
>
> BTW:
> While reading the source code I may found a other bug. Can you confirm
> this?
> In the function "OSDMonitor::check_failure" in   src/mon/OSDMonitor.cc
> the code which counts the "reporters_by_subtree" is in the if block "if
> (g_conf->mon_osd_adjust_heartbeat_grace) {".  So if I disable
> adjust_heartbeat_grace the reporters_by_subtree functionality will not
> work at all.
>
>
> Yes, I think you're correct and that's a (fairly nasty, to somebody
someday)  bug. Can you create a ticket at tracker.ceph.com? :)
-Greg

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] osd down detection broken in jewel?

Reply via email to