Re: [ceph-users] Disaster recovery of monitor

Jose Tavares Tue, 17 Nov 2015 09:15:59 -0800

Hi guys ...

Thanks a lot for your support.
I discovered what happened.


I had 2 monitors, osnode01 and osnode02.
I tried do add a 3rd by using ceph-deploy.

ceph-deploy was using a key different from the one used by my monitor
cluster.

So, I added osnode08 to the monitor cluster and it did not become part of
the quorum.
I removed it, and removed osnode02. The monitor count should be in an odd
number.

When I did that, my ceph stopped.
I readded osnode02 to the monitor cluster.
The thing is that I added using a wrong key. I don't know why ceph-deploy
started using a different key.

As suggested by Joao Eduardo, removing auth I could bring part of ceph up.
After that I troubleshooted this key problem, solved it, and know my whole
cluster is recovered and running fine ...

Thanks a lot.
Jose Tavares


On Tue, Nov 17, 2015 at 11:13 AM, Jose Tavares <j...@terra.com.br> wrote:

> Now I tried to inject the latest map I had.
> Also, I created a second monitor on osnode02, like I had before, using the
> same map.
> I started both monitors ...
>
> Logs from osnode01 show my content ... and then it started to show lines
> like
>
> 2015-11-17 10:56:26.515069 7fc73af67700  0 
> mon.osnode01@0(probing).data_health(1)
> update_stats avail 19% total 220 GB, used 178 GB, avail 43178 MB
>
> What does that mean?
> Attached are the logs.
>
> Thanks a lot.
> Jose Tavares
>
>
>
>
>
>
>
>
> On Tue, Nov 17, 2015 at 10:33 AM, Jose Tavares <j...@terra.com.br> wrote:
>
>>
>>
>> On Tue, Nov 17, 2015 at 7:27 AM, Joao Eduardo Luis <j...@suse.de> wrote:
>>
>>> On 11/17/2015 03:56 AM, Jose Tavares wrote:
>>> > The problem is that I think I don't have any good monitor anymore.
>>> > How do I know if the map I am trying is ok?
>>> >
>>> > I also saw in the logs that the primary mon was trying to contact a
>>> > removed mon at IP .112 .. So, I added .112 again ... and it didn't
>>> help.
>>> >
>>> > Attached are the logs of what is going on and some monmaps that I
>>> > capture that were from minutes before the cluster become inaccessible
>>> ..
>>> >
>>> > Should I try inject this monmaps in my primary mon to see if it can
>>> > recover the cluster?
>>> > Is it possible to see if this monmaps match my content?
>>>
>>> Without access to the actual store.db there's no way to ascertain if the
>>> store has any problems, and even then figuring out a potential
>>> corruption from just one monitor store.db would either be impossible or
>>> impractical.
>>>
>>
>> I posted my store.db in my previous answer ..
>>
>>
>>
>>>
>>> That said, from the log you attached it seems you only have issues with
>>> authentication: you have pgmaps from epoch 91923 through to 92589, you
>>> have an mds map (epoch 38), osdmaps at least through epoch 307, and 40
>>> versions for the auth keys.
>>>
>>> Somehow, though, your monitors are unable to authenticate each other. No
>>> way to tell if that was corruption or user error.
>>>
>>> You should be able to get your monitors back to speaking terms again
>>> simply by disabling cephx temporarily. Then you can figure out whatever
>>> you need to figure out in terms of monitor keys.
>>>
>>> Just update your ceph.conf with 'auth supported = none' and restart the
>>> monitors. See how it goes from there.
>>>
>>
>> I tried your suggestion and it didn't make any change to the results .. :(
>>
>> Thanks a lot.
>> Jose Tavares
>>
>>
>>
>>> HTH
>>>
>>>   -Joao
>>>
>>>
>>>
>>> >
>>> > Thanks a lot.
>>> > Jose Tavares
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > On Mon, Nov 16, 2015 at 10:48 PM, Nathan Harper
>>> > <nathan.har...@cfms.org.uk <mailto:nathan.har...@cfms.org.uk>> wrote:
>>> >
>>> >     I had to go through a similar process when we had a disaster which
>>> >     destroyed one of our monitors.   I followed the process here:
>>> >     REMOVING MONITORS FROM AN UNHEALTHY CLUSTER
>>> >     <http://docs.ceph.com/docs/hammer/rados/operations/add-or-rm-mons/>
>>> to
>>> >     remove all but one monitor, which let me bring the cluster back up.
>>> >
>>> >     As you are running an older version of Ceph than hammer, some of
>>> the
>>> >     commands might differ (perhaps this might
>>> >     help
>>> http://docs.ceph.com/docs/v0.80/rados/operations/add-or-rm-mons/)
>>> >
>>> >
>>> >     --
>>> >     *Nathan Harper*// IT Systems Architect
>>> >
>>> >     *e: * nathan.har...@cfms.org.uk <mailto:nathan.har...@cfms.org.uk>
>>> >     // *t: * 0117 906 1104 // *m: * 07875 510891 // *w: *
>>> >     www.cfms.org.uk <http://www.cfms.org.uk%22> // Linkedin grey icon
>>> >     scaled <http://uk.linkedin.com/pub/nathan-harper/21/696/b81>
>>> >     CFMS Services Ltd// Bristol & Bath Science Park // Dirac Crescent
>>> //
>>> >     Emersons Green // Bristol // BS16 7FR
>>> >
>>> >     CFMS Services Ltd is registered in England and Wales No 05742022 -
>>> a
>>> >     subsidiary of CFMS Ltd
>>> >     CFMS Services Ltd registered office // Victoria House // 51
>>> Victoria
>>> >     Street // Bristol // BS1 6AD
>>> >
>>> >     On 16 November 2015 at 16:50, Jose Tavares <j...@terra.com.br
>>> >     <mailto:j...@terra.com.br>> wrote:
>>> >
>>> >         Hi guys ...
>>> >         I need some help as my cluster seems to be corrupted.
>>> >
>>> >         I saw here ..
>>> >
>>> https://www.mail-archive.com/ceph-users@lists.ceph.com/msg01919.html
>>> >         .. a msg from 2013 where Peter had a problem with his monitors.
>>> >
>>> >         I had the same problem today when trying to add a new monitor,
>>> >         and than playing with monmap as the monitors were not entering
>>> >         the quorum. I'm using version 0.80.8.
>>> >
>>> >         Right now my cluster won't start because of a corrupted
>>> monitor.
>>> >         Is it possible to remove all monitors and create just a new one
>>> >         without losing data? I have ~260GB of data with work from 2
>>> weeks.
>>> >
>>> >         What should I do? Do you recommend any specific procedure?
>>> >
>>> >         Thanks a lot.
>>> >         Jose Tavares
>>> >
>>> >         _______________________________________________
>>> >         ceph-users mailing list
>>> >         ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
>>> >         http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > _______________________________________________
>>> > ceph-users mailing list
>>> > ceph-users@lists.ceph.com
>>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> >
>>>
>>>
>>
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Disaster recovery of monitor

Reply via email to