Re: [ceph-users] Monitor failure after series of traumatic network failures

Greg Chavez Tue, 24 Mar 2015 13:41:11 -0700

This was excellent advice. It should be on some official Ceph
troubleshooting page. It takes a while for the monitors to deal with new
info, but it works.


Thanks again!
--Greg

On Wed, Mar 18, 2015 at 5:24 PM, Sage Weil <s...@newdream.net> wrote:

> On Wed, 18 Mar 2015, Greg Chavez wrote:
> > We have a cuttlefish (0.61.9) 192-OSD cluster that has lost network
> > availability several times since this past Thursday and whose nodes were
> all
> > rebooted twice (hastily and inadvisably each time). The final reboot,
> which
> > was supposed to be "the last thing" before recovery according to our data
> > center team, resulted in a failure of the cluster's 4 monitors. This
> > happened yesterday afternoon.
> >
> > [ By the way, we use Ceph to back Cinder and Glance in our OpenStack
> Cloud,
> > block storage only; also this network problems were the result of our
> data
> > center team executing maintenance on our switches that was supposed to be
> > quick and painless ]
> >
> > After working all day on various troubleshooting techniques found here
> and
> > there, we have this situation on our monitor nodes (debug 20):
> >
> >
> > node-10: dead. ceph-mon will not start
> >
> > node-14: Seemed to rebuild its monmap. The log has stopped reporting with
> > this final tail -100: http://pastebin.com/tLiq2ewV
> >
> > node-16: Same as 14, similar outcome in the
> > log: http://pastebin.com/W87eT7Mw
> >
> > node-15: ceph-mon starts but even at debug 20, it will only ouput this
> line,
> > over and over again:
> >
> >        2015-03-18 14:54:35.859511 7f8c82ad3700 -1 asok(0x2e560e0)
> > AdminSocket: request 'mon_status' not defined
> >
> > node-02: I added this guy to replace node-10. I updated ceph.conf and
> pushed
> > it to all the monitor nodes (the osd nodes without monitors did not get
> the
> > config push). Since he's a new guy the log out is obviously different,
> but
> > again, here are the last 50 lines: http://pastebin.com/pfixdD3d
> >
> >
> > I run my ceph client from my OpenStack controller. All ceph -s shows me
> is
> > faults, albeit only to node-15
> >
> > 2015-03-18 16:47:27.145194 7ff762cff700  0 -- 192.168.241.100:0/15112 >>
> > 192.168.241.115:6789/0 pipe(0x7ff75000cf00 sd=3 :0 s=1 pgs=0 cs=0
> l=1).fault
> >
> >
> > Finally, here is our ceph.conf: http://pastebin.com/Gmiq2V8S
> >
> > So that's where we stand. Did we kill our Ceph Cluster (and thus our
> > OpenStack Cloud)?
>
> Unlikely!  You have 5 copies, and I doubt all of them are unrecoverable.
>
> > Or is there hope? Any suggestions would be greatly
> > appreciated.
>
> Stop all mons.
>
> Make a backup copy of each mon data dir.
>
> Copy the node-14 data dir over the node-15 and/or node-10 and/or
> node-02.
>
> Start all mons, see if they form a quorum.
>
> Once things are working again, at the *very* least upgrade to dumpling,
> and preferably then upgrade to firefly!!  Cuttlefish was EOL more than a
> year ago, and dumpling is EOL in a couple months.
>
> sage

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Monitor failure after series of traumatic network failures

Reply via email to