This was excellent advice. It should be on some official Ceph troubleshooting page. It takes a while for the monitors to deal with new info, but it works.
Thanks again! --Greg On Wed, Mar 18, 2015 at 5:24 PM, Sage Weil <s...@newdream.net> wrote: > On Wed, 18 Mar 2015, Greg Chavez wrote: > > We have a cuttlefish (0.61.9) 192-OSD cluster that has lost network > > availability several times since this past Thursday and whose nodes were > all > > rebooted twice (hastily and inadvisably each time). The final reboot, > which > > was supposed to be "the last thing" before recovery according to our data > > center team, resulted in a failure of the cluster's 4 monitors. This > > happened yesterday afternoon. > > > > [ By the way, we use Ceph to back Cinder and Glance in our OpenStack > Cloud, > > block storage only; also this network problems were the result of our > data > > center team executing maintenance on our switches that was supposed to be > > quick and painless ] > > > > After working all day on various troubleshooting techniques found here > and > > there, we have this situation on our monitor nodes (debug 20): > > > > > > node-10: dead. ceph-mon will not start > > > > node-14: Seemed to rebuild its monmap. The log has stopped reporting with > > this final tail -100: http://pastebin.com/tLiq2ewV > > > > node-16: Same as 14, similar outcome in the > > log: http://pastebin.com/W87eT7Mw > > > > node-15: ceph-mon starts but even at debug 20, it will only ouput this > line, > > over and over again: > > > > 2015-03-18 14:54:35.859511 7f8c82ad3700 -1 asok(0x2e560e0) > > AdminSocket: request 'mon_status' not defined > > > > node-02: I added this guy to replace node-10. I updated ceph.conf and > pushed > > it to all the monitor nodes (the osd nodes without monitors did not get > the > > config push). Since he's a new guy the log out is obviously different, > but > > again, here are the last 50 lines: http://pastebin.com/pfixdD3d > > > > > > I run my ceph client from my OpenStack controller. All ceph -s shows me > is > > faults, albeit only to node-15 > > > > 2015-03-18 16:47:27.145194 7ff762cff700 0 -- 192.168.241.100:0/15112 >> > > 192.168.241.115:6789/0 pipe(0x7ff75000cf00 sd=3 :0 s=1 pgs=0 cs=0 > l=1).fault > > > > > > Finally, here is our ceph.conf: http://pastebin.com/Gmiq2V8S > > > > So that's where we stand. Did we kill our Ceph Cluster (and thus our > > OpenStack Cloud)? > > Unlikely! You have 5 copies, and I doubt all of them are unrecoverable. > > > Or is there hope? Any suggestions would be greatly > > appreciated. > > Stop all mons. > > Make a backup copy of each mon data dir. > > Copy the node-14 data dir over the node-15 and/or node-10 and/or > node-02. > > Start all mons, see if they form a quorum. > > Once things are working again, at the *very* least upgrade to dumpling, > and preferably then upgrade to firefly!! Cuttlefish was EOL more than a > year ago, and dumpling is EOL in a couple months. > > sage
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com