> Normally in the case of a restart then somebody who used to have a
> connection to the OSD would still be running and flag it as dead. But
> if *all* the daemons in the cluster lose their soft state, that can't
> happen.

OK, thanks.  I guess that explains it.  But that's a pretty serious design
flaw, isn't it?  What I experienced is a pretty common failure mode: a power
outage caused the entire cluster to die simultaneously, then when power came
back, some OSDs didn't (the most common time for a server to fail is at
startup).

I wonder if I could close this gap with additional monitoring of my own.  I
could have a cluster bringup protocol that detects OSD processes that aren't
running after a while and mark those OSDs down.  It would be cleaner, though,
if I could just find out from the monitor what OSDs are in the map but not
connected to the monitor cluster.  Is that possible?

A related question: If I mark an OSD down administratively, does it stay down
until I give a command to mark it back up, or will the monitor detect signs of
life and declare it up again on its own?

-- 
Bryan Henderson                                   San Jose, California
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to