On Mon, Jul 1, 2019 at 8:56 PM Bryan Henderson <bry...@giraffe-data.com> wrote:
>
> > Normally in the case of a restart then somebody who used to have a
> > connection to the OSD would still be running and flag it as dead. But
> > if *all* the daemons in the cluster lose their soft state, that can't
> > happen.
>
> OK, thanks.  I guess that explains it.  But that's a pretty serious design
> flaw, isn't it?  What I experienced is a pretty common failure mode: a power
> outage caused the entire cluster to die simultaneously, then when power came
> back, some OSDs didn't (the most common time for a server to fail is at
> startup).

I am a little surprised; the peer OSDs used to detect this. But we've
re-done the heartbeat logic a few times and both losing a whole data
center's worth of daemons while not having monitoring to check if they
turn on actually isn't that common.

Can you create a tracker ticket with the version you're seeing it on
and any non-default configuration options you've set?
-Greg

>
> I wonder if I could close this gap with additional monitoring of my own.  I
> could have a cluster bringup protocol that detects OSD processes that aren't
> running after a while and mark those OSDs down.  It would be cleaner, though,
> if I could just find out from the monitor what OSDs are in the map but not
> connected to the monitor cluster.  Is that possible?
>
> A related question: If I mark an OSD down administratively, does it stay down
> until I give a command to mark it back up, or will the monitor detect signs of
> life and declare it up again on its own?
>
> --
> Bryan Henderson                                   San Jose, California
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to