On Sat, Jun 29, 2019 at 8:12 PM Bryan Henderson <bry...@giraffe-data.com> wrote:
> > I'm not sure why the monitor did not mark it _out_ after 600 seconds > > (default) > > Well, that part I understand. The monitor didn't mark the OSD out because > the > monitor still considered the OSD up. No reason to mark an up OSD out. > > I think the monitor should have marked the OSD down upon not hearing from > it > for 15 minutes ("mon osd report interval"), then out 10 minutes after that > ("mon osd down out interval"). > > And that's worst case. Though details of how OSDs watch each other are > vague, > I suspect an existing OSD was supposed to detect the dead OSDs and report > that > to the monitor, which would believe it within about a minute and mark the > OSDs > down. ("osd heartbeat interval", "mon osd min down reports", "mon osd min > down > reporters", "osd reporter subtree level"). > > -- > Bryan Henderson San Jose, California > So, if an OSD (osd.1) misses three heartbeats (6 seconds each) from another OSD (osd.2), then the OSD sending the heartbeats (osd.2) tells the monitor that the OSD (osd.1) is down. It takes two OSDs from different CRUSH subtrees (host by default) for the monitor to mark the host down. The OSD is supposed to report to the monitor each time there is a change or every 120 seconds, if 600 seconds pass with the monitor not hearing from the OSD, it will mark it down. It 'should' only take 20 seconds to detect a downed OSD. Usually, the problem is that an OSD gets too busy and misses heartbeats so other OSDs wrongly mark them down. If 'nodown' is set, then the monitor will not mark OSDs down. ---------------- Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com