[ceph-users] Re: MONs are down, the quorum is unable to resolve.

Brian Topping Mon, 12 Oct 2020 15:19:26 -0700

I see, maybe you want to look at these instructions. I don’t know if you are 
running Rook, but the point about getting the container alive by using `sleep` 
is important. Then you can get into the container with `exec` and do what you 
need to.


https://rook.io/docs/rook/v1.4/ceph-disaster-recovery.html#restoring-mon-quorum

> On Oct 12, 2020, at 4:16 PM, Gaël THEROND <gael.ther...@bitswalk.com> wrote:
> 
> Hi Brian!
> 
> Thanks a lot for your quick answer, it was fast !
> 
> Yes, I’ve read this doc, yet I can’t perform appropriate commands as my OSDs 
> are up and running.
> 
> As my mon is a container if I try to use ceph-mon —extract it won’t work as 
> the mon process is running and if I stop it the container will be restarted 
> and I’ll be ousted off it.
> 
> I can’t retrieve anything from ceph mon getmap as the quorum isn’t forming.
> 
> Yep, I know that I would need three nodes and I have a third node available 
> since recently for this lab.
> 
> unfortunately it’s a lab cluster and so one of my colleagues just took the 
> third node for testing purpose... I told you, a series of unfortunate events 
> :-)
> 
> I can’t get rid of the cluster as I can’t lost OSDs data.
> 
> G.
> 
> Le mar. 13 oct. 2020 à 00:01, Brian Topping <brian.topp...@gmail.com 
> <mailto:brian.topp...@gmail.com>> a écrit :
> Hi there!
> 
> This isn’t a difficult problem to fix. For purposes of clarity, the monmap is 
> just a part of the monitor database. You generally have all the details 
> correct though.
> 
> Have you looked at the process in 
> https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-mon/#recovering-a-monitor-s-broken-monmap?
>  
> <https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-mon/#recovering-a-monitor-s-broken-monmap?>
> 
> Please do make sure you are working on the copy of the monitor database with 
> the newest epoch. After removing the other monitors and getting your cluster 
> back online, you can re-add monitors at will.
> 
> Also note that a quorum is defined as "one-half the total number of nodes 
> plus one”. In your case, quorum is defined by both nodes! Taking either down 
> would cause this problem. So you need to have an odd number of nodes to 
> provide the ability to take a node down, for instance in a rolling upgrade.
> 
> Hope that helps! 
> 
> Brian
> 
> 
>> On Oct 12, 2020, at 3:54 PM, Gaël THEROND <gael.ther...@bitswalk.com 
>> <mailto:gael.ther...@bitswalk.com>> wrote:
>> 
> 
> 
>> Hi everyone,
>> 
>> Because of unfortunate events, I’ve a containers based ceph cluster
>> (nautilus) in a bad shape.
>> 
>> One of the lab cluster which is only made of 2 nodes as control plane (I
>> know it’s bad :-)) each of these nodes run a mon, a mgr and a rados-gw
>> containerized ceph_daemon.
>> 
>> They were installed using ceph-ansible if relevant for anyone.
>> 
>> However, when I was performing an upgrade on one of the first nodes, the
>> second went down too (electrical power outage).
>> 
>> As soon as I saw that I stopped all current process within the upgrading
>> node.
>> 
>> For now, if I try to restart my second node, as the quorum is looking for
>> two node the cluster isn’t available.
>> 
>> The container start, the node elect itself as the master but all ceph
>> commands are stuck forever, which is perfectly normal as the quorum still
>> wait for one member to achieve the election process etc.
>> 
>> So, my question is, as I can’t (to my knowledge) extract the monmap with
>> this intermediary state, and as my first node will still be considered as a
>> known mon and try to join back if started properly, can I just copy the
>> /etc/ceph.conf and /var/lib/mon/<host>/keyring from the last living node
>> (the second one) and copy everything at its own place within the first
>> node? My mon keys were the same for both mon initially and if I’m not
>> making any mistakes my first node being blank will try to create a default
>> store, join the existing cluster and try to retrieve the appropriate monmap
>> from the remaining node right?
>> 
>> If not, is there a process to be able to save/extract the monmap when using
>> a container based ceph ? I can perfectly exec on the remaining node if it
>> make any difference.
>> 
>> Thanks a lot!
> 
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@ceph.io <mailto:ceph-users@ceph.io>
>> To unsubscribe send an email to ceph-users-le...@ceph.io 
>> <mailto:ceph-users-le...@ceph.io>
> 

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: MONs are down, the quorum is unable to resolve.

Reply via email to