On Sat, Jun 29, 2019 at 8:12 PM Bryan Henderson <bry...@giraffe-data.com>
wrote:

> > I'm not sure why the monitor did not mark it _out_ after 600 seconds
> > (default)
>
> Well, that part I understand.  The monitor didn't mark the OSD out because
> the
> monitor still considered the OSD up.  No reason to mark an up OSD out.
>
> I think the monitor should have marked the OSD down upon not hearing from
> it
> for 15 minutes ("mon osd report interval"), then out 10 minutes after that
> ("mon osd down out interval").
>
> And that's worst case.  Though details of how OSDs watch each other are
> vague,
> I suspect an existing OSD was supposed to detect the dead OSDs and report
> that
> to the monitor, which would believe it within about a minute and mark the
> OSDs
> down.  ("osd heartbeat interval", "mon osd min down reports", "mon osd min
> down
> reporters", "osd reporter subtree level").
>
> --
> Bryan Henderson                                   San Jose, California
>

So, if an OSD (osd.1) misses three heartbeats (6 seconds each) from another
OSD (osd.2), then the OSD sending the heartbeats (osd.2) tells the monitor
that the OSD (osd.1) is down. It takes two OSDs from different CRUSH
subtrees (host by default) for the monitor to mark the host down. The OSD
is supposed to report to the monitor each time there is a change or every
120 seconds, if 600 seconds pass with the monitor not hearing from the OSD,
it will mark it down. It 'should' only take 20 seconds to detect a downed
OSD.

Usually, the problem is that an OSD gets too busy and misses heartbeats so
other OSDs wrongly mark them down.

If 'nodown' is set, then the monitor will not mark OSDs down.
----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to