On Tue, Nov 17, 2015 at 6:32 AM, Wido den Hollander <w...@42on.com> wrote:
> On 11/17/2015 04:56 AM, Jose Tavares wrote: > > The problem is that I think I don't have any good monitor anymore. > > How do I know if the map I am trying is ok? > > > > How do you mean there is no good monitor? Did you encounter a disk > failure or something? > No. Describing in detail what I did ... After adding and removing monitors a few times to try to get them into the quorum, I finished with just 1 mon and a monmap that was reflecting 3 monitors. The only remaining monitor got stuck. After that, I took the monmap, removed the monitors and inject the monmap back, following this suggestion .. http://docs.ceph.com/docs/v0.78/rados/operations/add-or-rm-mons/#removing-monitors-from-an-unhealthy-cluster Mon was stopped by the time I committed the changes. > > > I also saw in the logs that the primary mon was trying to contact a > > removed mon at IP .112 .. So, I added .112 again ... and it didn't help. > > > > "Added" again? You started that monitor? > Yes, I saw that after starting the monitor, logs pointed to .112, so I started it again .. > > > Attached are the logs of what is going on and some monmaps that I > > capture that were from minutes before the cluster become inaccessible .. > > > > Isn't there a huge timedrift somewhere? Failing cephx authorization can > also point at a huge timedrift on the clients and OSDs. Are you sure the > time is correct? > The only timedrift could be from the injected monmap from some minutes before ..... > > Should I try inject this monmaps in my primary mon to see if it can > > recover the cluster? > > Is it possible to see if this monmaps match my content? > > > > The monmaps probably didn't change that much. But a good Monitor also > has the PGMaps, OSDMaps, etc. You need a lot more then just a monmap. > > But check the time first on those machines. > Times are ok ... About store.db, I have the following .. osnode01:/var/lib/ceph/mon/ceph-osnode01 # ls -lR * -rw-r--r-- 1 root root 0 Nov 16 14:58 done -rw-r--r-- 1 root root 77 Nov 3 17:43 keyring -rw-r--r-- 1 root root 0 Nov 3 17:43 systemd store.db: total 2560 -rw-r--r-- 1 root root 2105177 Nov 16 19:09 004629.ldb -rw-r--r-- 1 root root 250057 Nov 16 19:09 004630.ldb -rw-r--r-- 1 root root 215396 Nov 16 19:36 004632.ldb -rw-r--r-- 1 root root 282 Nov 16 19:42 004637.ldb -rw-r--r-- 1 root root 17428 Nov 16 19:54 004640.ldb -rw-r--r-- 1 root root 0 Nov 17 10:21 004653.log -rw-r--r-- 1 root root 16 Nov 17 10:21 CURRENT -rw-r--r-- 1 root root 0 Nov 3 17:43 LOCK -rw-r--r-- 1 root root 311 Nov 17 10:21 MANIFEST-004652 osnode01:/var/lib/ceph/mon/ceph-osnode01 # My concern is about this log line .... 2015-11-17 10:11:16.143864 7f81e14aa700 0 mon.osnode01@0(probing).data_health(0) update_stats avail 19% total 220 GB, used 178 GB, avail 43194 MB I use to have 7TB of available space with 263G of content replicated to ~800G .. Thanks a lot .. Jose Tavares > Wido > > > Thanks a lot. > > Jose Tavares > > > > > > > > > > > > On Mon, Nov 16, 2015 at 10:48 PM, Nathan Harper > > <nathan.har...@cfms.org.uk <mailto:nathan.har...@cfms.org.uk>> wrote: > > > > I had to go through a similar process when we had a disaster which > > destroyed one of our monitors. I followed the process here: > > REMOVING MONITORS FROM AN UNHEALTHY CLUSTER > > <http://docs.ceph.com/docs/hammer/rados/operations/add-or-rm-mons/> > to > > remove all but one monitor, which let me bring the cluster back up. > > > > As you are running an older version of Ceph than hammer, some of the > > commands might differ (perhaps this might > > help > http://docs.ceph.com/docs/v0.80/rados/operations/add-or-rm-mons/) > > > > > > -- > > *Nathan Harper*// IT Systems Architect > > > > *e: * nathan.har...@cfms.org.uk <mailto:nathan.har...@cfms.org.uk> > > // *t: * 0117 906 1104 // *m: * 07875 510891 // *w: * > > www.cfms.org.uk <http://www.cfms.org.uk%22> // Linkedin grey icon > > scaled <http://uk.linkedin.com/pub/nathan-harper/21/696/b81> > > CFMS Services Ltd// Bristol & Bath Science Park // Dirac Crescent // > > Emersons Green // Bristol // BS16 7FR > > > "> CFMS Services Ltd is registered in England and Wales No 05742022 - a > > subsidiary of CFMS Ltd > > CFMS Services Ltd registered office // Victoria House // 51 Victoria > > Street // Bristol // BS1 6AD > > > > On 16 November 2015 at 16:50, Jose Tavares <j...@terra.com.br > > <mailto:j...@terra.com.br>> wrote: > > > > Hi guys ... > > I need some help as my cluster seems to be corrupted. > > > > I saw here .. > > > https://www.mail-archive.com/ceph-users@lists.ceph.com/msg01919.html > > .. a msg from 2013 where Peter had a problem with his monitors. > > > > I had the same problem today when trying to add a new monitor, > > and than playing with monmap as the monitors were not entering > > the quorum. I'm using version 0.80.8. > > > > Right now my cluster won't start because of a corrupted monitor. > > Is it possible to remove all monitors and create just a new one > > without losing data? I have ~260GB of data with work from 2 > weeks. > > > > What should I do? Do you recommend any specific procedure? > > > > Thanks a lot. > > Jose Tavares > > > > _______________________________________________ > > ceph-users mailing list > > ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > > > > > > > > _______________________________________________ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > -- > Wido den Hollander > 42on B.V. > Ceph trainer and consultant > > Phone: +31 (0)20 700 9902 > Skype: contact42on > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com