[ceph-users] Re: Urgent help: I accidentally nuked all my Monitor

Jonas Schwab Thu, 10 Apr 2025 22:18:52 -0700

Yes mgrs are running as intended. It just seems that mons and osd don't
recongnize each other, because the monitors map is outdated.


On 2025-04-11 07:07, Eugen Block wrote:

Is at least one mgr running? PG states are reported by the mgr daemon.

Zitat von Jonas Schwab <jonas.sch...@uni-wuerzburg.de>:

I solved the problem with executing ceph-mon. Among others, -i
mon.rgw2-06 was not the correct option, but rather -i rgw2-06.
Unfortunately, that brought the next problem:

The cluster now shows "100.000% pgs unknown", which is probably because
the monitor data is not complete up to date, but rather the state it was
in before I switched over to other mons. A few minutes or s after that,
the cluster crashed and I lust the mons. I guess this outdated cluster
map is probably unusable? All services seem to be running fine and there
are not network obstructions.

Should I  instead go with this:
https://docs.ceph.com/en/squid/rados/troubleshooting/troubleshooting-mon/#recovery-using-osds

?

I actually already tried the latter option, but ran into the error
`rocksdb: [db/db_impl/db_impl_open.cc:2086] DB::Open() failed: IO error:
while open a file for lock:
/var/lib/ceph/mon/ceph-ceph2-01/store.db/LOCK: Permission denied`
Even though I double checked that the permission and ownership on the
replacing store.db are properly set.


On 2025-04-10 22:45, Jonas Schwab wrote:

I edited the monmap to include only rgw2-06 and then followed
https://docs.ceph.com/en/squid/rados/operations/add-or-rm-mons/#adding-a-monitor-manual


to create a new monitor.

Unfortunately, `ceph-mon -i mon.rgw2-06 --public-addr 10.127.239.63 -f`
crashed with the traceback seen in the attachment.

On 2025-04-10 20:34, Eugen Block wrote:

It depends a bit. Which mon do the OSDs still know about? You can
check /var/lib/ceph/<ceph_fsid>/osd.X/config to retrieve that piece of
information. I'd try to revive one of them.
Do you still have the mon store.db for all of the mons or at least one
of them? Just to be safe, back up all the store.db directories.

Then modify a monmap to contain the one you want to revive by removing
the other ones. Backup your monmap files as well. Then inject the
modified monmap into the daemon and try starting it.

Zitat von Jonas Schwab <jonas.sch...@uni-wuerzburg.de>:

Again, thank you very much for your help!

The container is not there any more, but I discovered that the "old"
mon
data still exists. I have the same situation for two mons I
removed at
the same time:

$ monmaptool --print monmap1
monmaptool: monmap file monmap1
epoch 29
fsid 6d0d4ed4-0052-4eb9-9d9d-e6872ba7ee96
last_changed 2025-04-10T14:16:21.203171+0200
created 2021-02-26T14:02:29.522695+0100
min_mon_release 19 (squid)
election_strategy: 1
0: [v2:10.127.239.2:3300/0,v1:10.127.239.2:6789/0] mon.ceph2-02
1: [v2:10.127.239.61:3300/0,v1:10.127.239.61:6789/0] mon.rgw2-04
2: [v2:10.127.239.63:3300/0,v1:10.127.239.63:6789/0] mon.rgw2-06
3: [v2:10.127.239.62:3300/0,v1:10.127.239.62:6789/0] mon.rgw2-05

$ monmaptool --print monmap2
monmaptool: monmap file monmap2
epoch 30
fsid 6d0d4ed4-0052-4eb9-9d9d-e6872ba7ee96
last_changed 2025-04-10T14:16:43.216713+0200
created 2021-02-26T14:02:29.522695+0100
min_mon_release 19 (unknown)
election_strategy: 1
0: [v2:10.127.239.61:3300/0,v1:10.127.239.61:6789/0] mon.rgw2-04
1: [v2:10.127.239.63:3300/0,v1:10.127.239.63:6789/0] mon.rgw2-06
2: [v2:10.127.239.62:3300/0,v1:10.127.239.62:6789/0] mon.rgw2-05

Would it be feasible to move the data from node1 (which still
contains
node2 as mon) to node2, or would that just result in even more mess?


On 2025-04-10 19:57, Eugen Block wrote:

It can work, but it might be necessary to modify the monmap first,
since it's complaining that it has been removed from it. Are you
familiar with the monmap-tool
(https://docs.ceph.com/en/latest/man/8/monmaptool/)?

The procedure is similar to changing a monitor's IP address the
"messy
way"
(https://docs.ceph.com/en/latest/rados/operations/add-or-rm-mons/#changing-a-monitor-s-ip-address-advanced-method).





I also wrote a blog post how to do it with cephadm:
https://heiterbiswolkig.blogs.nde.ag/2020/12/18/cephadm-changing-a-monitors-ip-address/





But before changing anything, I'd inspect first what the current
status is. You can get the current monmap from  within the mon
container (is it still there?):

cephadm shell --name mon.<mon>
ceph-monstore-tool /var/lib/ceph/mon/<your_mon> get monmap -- --out
monmap
monmaptool --print monmap

You can paste the output here, if you want.

Zitat von Jonas Schwab <jonas.sch...@uni-wuerzburg.de>:

I realized, I have access to a data directory of a monitor I
removed
just before the oopsie happened. Can I launch a ceph-mon from that?
If I
try just to launch ceph-mon, it commits suicide:

2025-04-10T19:32:32.174+0200 7fec628c5e00 -1
mon.mon.ceph2-01@-1(???)
e29 not in monmap and have been in a quorum before; must have been
removed
2025-04-10T19:32:32.174+0200 7fec628c5e00 -1
mon.mon.ceph2-01@-1(???)
e29 commit suicide!
2025-04-10T19:32:32.174+0200 7fec628c5e00 -1 failed to initialize

On 2025-04-10 16:01, Jonas Schwab wrote:

Hello everyone,

I believe I accidentally nuked all monitor of my cluster (please
don't
ask how). Is there a way to recover from this desaster? I have a
cephadm
setup.

I am very grateful for all help!

Best regards,
Jonas Schwab
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


--
Jonas Schwab

Research Data Management, Cluster of Excellence ct.qmat
https://data.ctqmat.de | datamanagement.ct.q...@listserv.dfn.de
Email: jonas.sch...@uni-wuerzburg.de
Tel: +49 931 31-84460
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


--
Jonas Schwab

Research Data Management, Cluster of Excellence ct.qmat
https://data.ctqmat.de | datamanagement.ct.q...@listserv.dfn.de
Email: jonas.sch...@uni-wuerzburg.de
Tel: +49 931 31-84460

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


--
Jonas Schwab

Research Data Management, Cluster of Excellence ct.qmat
https://data.ctqmat.de | datamanagement.ct.q...@listserv.dfn.de
Email: jonas.sch...@uni-wuerzburg.de
Tel: +49 931 31-84460
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Urgent help: I accidentally nuked all my Monitor

Reply via email to