Hi, You are not first with this issue If you are on 146% sure that is not a network (arp, ip, mtu, firewall) issue - I suggest to remove this mon and deploy it again. Or deploy on another (unused) ipaddr Also, you can add --debug_ms=20 and you should see some "lossy channel" messages before quorum join fails
k > On 29 Mar 2022, at 15:20, Thomas Bruckmann <thomas.bruckm...@softgarden.de> > wrote: > > Hello again, > increased the Debug level now to a maximum for the mons and I still have no > idea what the problem could be. > > So I just print the Debug Log of the Mon failing to join here, in hope, > someone could help me. In addition, it seems the mon not joining, stays quiet > long in the probing phase, sometimes it switches to synchronizing, which > seems to work and after that its back on probing. > > debug 2022-03-29T11:10:53.695+0000 7f81c0811700 10 > mon.controller2@-1(probing) e16 bootstrap > debug 2022-03-29T11:10:53.695+0000 7f81c0811700 10 > mon.controller2@-1(probing) e16 sync_reset_requester > debug 2022-03-29T11:10:53.695+0000 7f81c0811700 10 > mon.controller2@-1(probing) e16 unregister_cluster_logger - not registered > debug 2022-03-29T11:10:53.695+0000 7f81c0811700 10 > mon.controller2@-1(probing) e16 cancel_probe_timeout (none scheduled) > debug 2022-03-29T11:10:53.695+0000 7f81c0811700 10 > mon.controller2@-1(probing) e16 monmap e16: 3 mons at > {controller1=[v2:192.168.9.206:3300/0,v1:192.168.9.206:6789/0],controller4=[v2:192.168.9.209:3300/0,v1:192.168.9.209:6789/0],controller5=[v2:192.168.9.210:3300/0,v1:192.168.9.210:6789/0]} > debug 2022-03-29T11:10:53.695+0000 7f81c0811700 10 > mon.controller2@-1(probing) e16 _reset > debug 2022-03-29T11:10:53.695+0000 7f81c0811700 10 > mon.controller2@-1(probing).auth v46972 _set_mon_num_rank num 0 rank 0 > debug 2022-03-29T11:10:53.695+0000 7f81c0811700 10 > mon.controller2@-1(probing) e16 cancel_probe_timeout (none scheduled) > debug 2022-03-29T11:10:53.695+0000 7f81c0811700 10 > mon.controller2@-1(probing) e16 timecheck_finish > debug 2022-03-29T11:10:53.695+0000 7f81c0811700 15 > mon.controller2@-1(probing) e16 health_tick_stop > debug 2022-03-29T11:10:53.695+0000 7f81c0811700 15 > mon.controller2@-1(probing) e16 health_interval_stop > debug 2022-03-29T11:10:53.695+0000 7f81c0811700 10 > mon.controller2@-1(probing) e16 scrub_event_cancel > debug 2022-03-29T11:10:53.695+0000 7f81c0811700 10 > mon.controller2@-1(probing) e16 scrub_reset > debug 2022-03-29T11:10:53.695+0000 7f81c0811700 10 > mon.controller2@-1(probing) e16 cancel_probe_timeout (none scheduled) > debug 2022-03-29T11:10:53.695+0000 7f81c0811700 10 > mon.controller2@-1(probing) e16 reset_probe_timeout 0x55c46fbb8d80 after 2 > seconds > debug 2022-03-29T11:10:53.695+0000 7f81c0811700 10 > mon.controller2@-1(probing) e16 probing other monitors > debug 2022-03-29T11:10:53.695+0000 7f81be00c700 20 > mon.controller2@-1(probing) e16 _ms_dispatch existing session 0x55c46f8d4900 > for mon.2 > debug 2022-03-29T11:10:53.695+0000 7f81be00c700 20 > mon.controller2@-1(probing) e16 entity_name global_id 0 (none) caps allow * > debug 2022-03-29T11:10:53.695+0000 7f81be00c700 20 is_capable service=mon > command= read addr v2:192.168.9.210:3300/0 on cap allow * > debug 2022-03-29T11:10:53.695+0000 7f81be00c700 20 allow so far , doing > grant allow * > debug 2022-03-29T11:10:53.695+0000 7f81be00c700 20 allow all > debug 2022-03-29T11:10:53.695+0000 7f81be00c700 10 > mon.controller2@-1(probing) e16 handle_probe mon_probe(reply > 9d036488-fb4f-4e5b-85ec-4ccf75501b48 name controller5 quorum 0,1,2 leader 0 > paxos( fc 133912517 lc 133913211 ) mon_release pacific) v8 > debug 2022-03-29T11:10:53.695+0000 7f81be00c700 10 > mon.controller2@-1(probing) e16 handle_probe_reply mon.2 > v2:192.168.9.210:3300/0 mon_probe(reply 9d036488-fb4f-4e5b-85ec-4ccf75501b48 > name controller5 quorum 0,1,2 leader 0 paxos( fc 133912517 lc 133913211 ) > mon_release pacific) v8 > debug 2022-03-29T11:10:53.695+0000 7f81be00c700 10 > mon.controller2@-1(probing) e16 monmap is e16: 3 mons at > {controller1=[v2:192.168.9.206:3300/0,v1:192.168.9.206:6789/0],controller4=[v2:192.168.9.209:3300/0,v1:192.168.9.209:6789/0],controller5=[v2:192.168.9.210:3300/0,v1:192.168.9.210:6789/0]} > debug 2022-03-29T11:10:53.695+0000 7f81be00c700 10 > mon.controller2@-1(probing) e16 peer name is controller5 > debug 2022-03-29T11:10:53.695+0000 7f81be00c700 10 > mon.controller2@-1(probing) e16 existing quorum 0,1,2 > debug 2022-03-29T11:10:53.695+0000 7f81be00c700 10 > mon.controller2@-1(probing) e16 peer paxos version 133913211 vs my version > 133913204 (ok) > debug 2022-03-29T11:10:53.695+0000 7f81be00c700 10 > mon.controller2@-1(probing) e16 ready to join, but i'm not in the monmap/my > addr is blank/location is wrong, trying to join > debug 2022-03-29T11:10:53.695+0000 7f81be00c700 20 > mon.controller2@-1(probing) e16 _ms_dispatch existing session 0x55c46f8d4b40 > for mon.1 > debug 2022-03-29T11:10:53.695+0000 7f81be00c700 20 > mon.controller2@-1(probing) e16 entity_name global_id 0 (none) caps allow * > debug 2022-03-29T11:10:53.695+0000 7f81be00c700 20 is_capable service=mon > command= read addr v2:192.168.9.209:3300/0 on cap allow * > debug 2022-03-29T11:10:53.695+0000 7f81be00c700 20 allow so far , doing > grant allow * > debug 2022-03-29T11:10:53.695+0000 7f81be00c700 20 allow all > debug 2022-03-29T11:10:53.695+0000 7f81be00c700 10 > mon.controller2@-1(probing) e16 handle_probe mon_probe(reply > 9d036488-fb4f-4e5b-85ec-4ccf75501b48 name controller4 quorum 0,1,2 leader 0 > paxos( fc 133912517 lc 133913211 ) mon_release pacific) v8 > debug 2022-03-29T11:10:53.695+0000 7f81be00c700 10 > mon.controller2@-1(probing) e16 handle_probe_reply mon.1 > v2:192.168.9.209:3300/0 mon_probe(reply 9d036488-fb4f-4e5b-85ec-4ccf75501b48 > name controller4 quorum 0,1,2 leader 0 paxos( fc 133912517 lc 133913211 ) > mon_release pacific) v8 > debug 2022-03-29T11:10:53.695+0000 7f81be00c700 10 > mon.controller2@-1(probing) e16 monmap is e16: 3 mons at > {controller1=[v2:192.168.9.206:3300/0,v1:192.168.9.206:6789/0],controller4=[v2:192.168.9.209:3300/0,v1:192.168.9.209:6789/0],controller5=[v2:192.168.9.210:3300/0,v1:192.168.9.210:6789/0]} > debug 2022-03-29T11:10:53.695+0000 7f81be00c700 10 > mon.controller2@-1(probing) e16 peer name is controller4 > debug 2022-03-29T11:10:53.695+0000 7f81be00c700 10 > mon.controller2@-1(probing) e16 existing quorum 0,1,2 > debug 2022-03-29T11:10:53.695+0000 7f81be00c700 10 > mon.controller2@-1(probing) e16 peer paxos version 133913211 vs my version > 133913204 (ok) > debug 2022-03-29T11:10:53.695+0000 7f81be00c700 10 > mon.controller2@-1(probing) e16 ready to join, but i'm not in the monmap/my > addr is blank/location is wrong, trying to join > debug 2022-03-29T11:10:54.453+0000 7f81c2014700 10 > mon.controller2@-1(probing) e16 get_authorizer for mgr > debug 2022-03-29T11:10:55.453+0000 7f81c2014700 10 > mon.controller2@-1(probing) e16 get_authorizer for mgr > debug 2022-03-29T11:10:55.695+0000 7f81c0811700 4 > mon.controller2@-1(probing) e16 probe_timeout 0x55c46fbb8d80 > debug 2022-03-29T11:10:55.695+0000 7f81c0811700 10 > mon.controller2@-1(probing) e16 bootstrap > > Kind Regards, > Thomas Bruckmann > Systemadministrator Cloud Dienste > E > thomas.bruckm...@softgarden.de<mailto:%20thomas.bruckm...@softgarden.de;> > softgarden e-recruiting GmbH > Tauentzienstraße 14 | 10789 Berlin > https://softgarden.de/ > Gesellschaft mit beschränkter Haftung, Amtsgericht Berlin-Charlottenburg > HRB 114159 B | USt-ID: DE260440441 | Geschäftsführer: Mathias Heese, Stefan > Schüffler, Claus Müller > > > Von: Thomas Bruckmann <thomas.bruckm...@softgarden.de> > Datum: Donnerstag, 24. März 2022 um 17:06 > An: ceph-users@ceph.io <ceph-users@ceph.io> > Betreff: [ceph-users] Ceph Mon not able to authenticate > Hello, > We are running ceph 16.2.6 and having trouble with our mon’s everything is > managed via ceph orch and running in containers. Since we switched our > firewall in the DC (which also makes DNS) our ceph mon daemons are not able > to authenticate when they are restarted. > > The errormessage in the monitor log is: > > debug 2022-03-24T14:25:12.716+0000 7fa0dc2df700 1 mon.2@-1(probing) e13 > handle_auth_request failed to assign global_id > > What we already tried to solve the problem: > > * Removed the mon fully from the node (including all artifacts in the FS) > * Doublechecked if the mon is still in the monmap after removing it (it is > not) > * Added other mons (which were previously no mons) to ensure a unique and > synced monmap and tried adding the failing mon -> no success > * Shutted down a running mon (no one of the brand new) and tried bringing > it up again -> same error > > It seems not to be an error with the monmap, however manipulating the monmap > manually is currently not possible, since the system is prod and we cannot > shutdown the whole FS. > > Another Blogpost, I do not find the link anymore, say the problem could be > related to the dns resolution somehow, that may the dns name behind the IP > has changed. For each of our initial mons, we have 3 different DNS names, > which are returned on a reverse lookup, since we switched the Firewall, may > to order those names are returned has changed. Don’t know if this could be to > problem. > > Does may anyone has an Idea how to solve the Problem? > > Kind Regards, > Thomas Bruckmann > _______________________________________________ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > _______________________________________________ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io