Hi Eugen,
  Thanks for your response.

  Out of interest things that I've done overnight are stopping all the daemons 
(OSDs and RGWs were the ones still running) - so I'm just dealing with the 3 
MONs now. Trying different start-sequences, I can determine:

* mon3 was the last one working
* Starting mon1 will kill mon3 (and prevent it starting) with that crash 
mentioned in the original email
* Similarly starting mon2 will kill both mon1 and mon3 in the same way
* Only mon3 gets the fast spamming of "e6 handle_auth_request failed to assign 
global_id" log messages when it's running.
* Dumping the monmap results in the same file on all 3 mons.

As for your suggestion of reducing the monmap to 1 node and rebuilding, we were 
also thinking of heading down that path. I'm hoping that deploying a temporary 
4th mon on a new node might be able to get two nodes running (without killing 
the "old" one). Probably using mon3 because it's likely the most up-to-date. If 
that works, we could try clobbering and redeploying the other two "old" mon 
daemons and removing the temporary one to get back to the original 3 mons. As 
you say: using their original IP addresses (one of the clients is 
Openstack/RBD, which can be sentimental about mon IPs).

I'm just in a bit of decision paralysis about which mon to take as the 
survivor. All can run _individually_, but only mon2 will survive a group start. 
mon3 was the last one working, but it has the mysterious "failed to assign 
global ID" errors. I'm leaning toward using mon3.. or mon2.

Thanks for listening,

M0les.


On Wed, 18 Jun 2025, at 17:04, Eugen Block wrote:
> Hi,
> 
> correct, SUSE's Ceph product was Salt-based, in this case 14.2.22 was  
> shipped with SES 6. ;-)
> 
> Do you also have some mon logs from right before the crash, maybe with  
> a higher debug level? It could make sense to stop client traffic and  
> OSDs as well to be able to recover. But unfortunately, I can't really  
> comment on the stack trace.
> 
> Maybe someone has a different idea, but if you get one MON up, I would  
> probably reduce the monmap to 1 MON to bring the cluster back up. Back  
> up all the MON stores, just in case you have to start over. Then  
> extract the monmap, remove all but one, and inject the modified monmap  
> into the MON you want to revive. The procedure is described here [0].  
> Just don't change the address but only reduce the monmap. ;-)
> 
> Regards,
> Eugen
> 
> [0]  
> https://docs.ceph.com/en/latest/rados/operations/add-or-rm-mons/#changing-a-monitor-s-ip-address-advanced-method
> 
> Zitat von Miles Goodhew <c...@m0les.com>:
> 
> > Hi,
> >   I've been called-in by a client with an ancient SUSE-based Ceph  
> > Nautilus (14.2.22) who's MONs keep dieing oddly.
> >   Apparently the issue started with MDS daemons not working and  
> > eventuallt a MON restart killed the cluster.
> >
> > OS: SLES 15-SP1 (out of support)
> > Ceph: 14.2.22 "Nautilus" (Deployed with Salt... I think)
> > 3 MONs; 5 MDSs; 3 MGRs; 4 RGWs; 336 OSDs on 21 nodes.
> > Client services: "One of everything at least", but RBD/Openstack,  
> > S3/RGW and CephFS are big ones.
> >
> >   After sorting out some of the logs here are some things I know:  
> > Disk space, RAM availability, inodes and network connectivity seem  
> > OK to me. After shutting-down all the MONs, MGRs and MDSes, one MON  
> > can usually be started, but it sits there spamming-out log messages  
> > like "[SERVICE_ID](probing) e6 handle_auth_request failed to assign  
> > global_id" (maybe 50 - 100 times per second). All the while the  
> > syslog shows 'e6 get_health_metrics reporting [INCREASING_NUMBER]  
> > slow ops` fairly often. This is probably due to OSDs and clients  
> > being active.
> >
> >   If I restart one of the other MONs, the running one will die with  
> > a stack trace at (Limiting to C++/library internal calls):
> >
> > ```
> > 8: (std::__throw_out_of_range(char const*)+0x41) [0x7f2a5983fa07]
> > 9: (MDSMonitor::maybe_resize_cluster(FSMap&, int)+0xcf0) [0x55b441e37490]
> > 10: (MDSMonitor::tick()+0xc9) [0x55b441e38ce9]
> > 11: (MDSMonitor::on_active()+0x28) [0x55b441e22fa8]
> > 12: (PaxosService::_active()+0xdd) [0x55b441d7188d]
> > 13: (Context::complete(int)+0x9) [0x55b441c888a9]
> > 14: (void finish_contexts<std::__cxx11::list<Context*,  
> > std::allocator<Context*> > >(CephContext*,  
> > std::__cxx11::list<Context*, std::allocator<Context*> >&, int)+0xa8)  
> > [0x55b441cb2408]
> > 15: (Paxos::finish_round()+0x76) [0x55b441d681b6]
> > 16: (Paxos::handle_last(boost::intrusive_ptr<MonOpRequest>)+0xc1f)  
> > [0x55b441d693df]
> > 17: (Paxos::dispatch(boost::intrusive_ptr<MonOpRequest>)+0x233)  
> > [0x55b441d69e23]
> > 18:  
> > (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x1668)  
> > [0x55b441c820b8]
> > 19: (Monitor::_ms_dispatch(Message*)+0xa3a) [0x55b441c82b5a]
> > 20: (Monitor::ms_dispatch(Message*)+0x26) [0x55b441cb3646]
> > 21: (Dispatcher::ms_dispatch2(boost::intrusive_ptr<Message>  
> > const&)+0x26) [0x55b441cb00b6]
> > 22: (DispatchQueue::entry()+0x1279) [0x7f2a5b188379]
> > 23: (DispatchQueue::DispatchThread::entry()+0xd) [0x7f2a5b238a5d]
> > 24: (()+0x8539) [0x7f2a59db7539]
> > 25: (clone()+0x3f) [0x7f2a58f87ecf]
> > ```
> >
> > Anyone got any clues about how to diagnose or better-yet repair this?
> >
> > Sorry, I know this is a bit half-baked, but I'm trying to dump this  
> > help request at COB to see if I can hook anyone's interest overnight.
> >
> > Thanks for at least reading this far,
> >
> > M0les.
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> 
> 
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
> 
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to