Hi,

You are not first with this issue
If you are on 146% sure that is not a network (arp, ip, mtu, firewall) issue - 
I suggest to remove this mon and deploy it again. Or deploy on another (unused) 
ipaddr
Also, you can add --debug_ms=20 and you should see some "lossy channel" 
messages before quorum join fails


k

> On 29 Mar 2022, at 15:20, Thomas Bruckmann <thomas.bruckm...@softgarden.de> 
> wrote:
> 
> Hello again,
> increased the Debug level now to a maximum for the mons and I still have no 
> idea what the problem could be.
> 
> So I just print the Debug Log of the Mon failing to join here, in hope, 
> someone could help me. In addition, it seems the mon not joining, stays quiet 
> long in the probing phase, sometimes it switches to synchronizing, which 
> seems to work and after that its back on probing.
> 
> debug 2022-03-29T11:10:53.695+0000 7f81c0811700 10 
> mon.controller2@-1(probing) e16 bootstrap
> debug 2022-03-29T11:10:53.695+0000 7f81c0811700 10 
> mon.controller2@-1(probing) e16 sync_reset_requester
> debug 2022-03-29T11:10:53.695+0000 7f81c0811700 10 
> mon.controller2@-1(probing) e16 unregister_cluster_logger - not registered
> debug 2022-03-29T11:10:53.695+0000 7f81c0811700 10 
> mon.controller2@-1(probing) e16 cancel_probe_timeout (none scheduled)
> debug 2022-03-29T11:10:53.695+0000 7f81c0811700 10 
> mon.controller2@-1(probing) e16 monmap e16: 3 mons at 
> {controller1=[v2:192.168.9.206:3300/0,v1:192.168.9.206:6789/0],controller4=[v2:192.168.9.209:3300/0,v1:192.168.9.209:6789/0],controller5=[v2:192.168.9.210:3300/0,v1:192.168.9.210:6789/0]}
> debug 2022-03-29T11:10:53.695+0000 7f81c0811700 10 
> mon.controller2@-1(probing) e16 _reset
> debug 2022-03-29T11:10:53.695+0000 7f81c0811700 10 
> mon.controller2@-1(probing).auth v46972 _set_mon_num_rank num 0 rank 0
> debug 2022-03-29T11:10:53.695+0000 7f81c0811700 10 
> mon.controller2@-1(probing) e16 cancel_probe_timeout (none scheduled)
> debug 2022-03-29T11:10:53.695+0000 7f81c0811700 10 
> mon.controller2@-1(probing) e16 timecheck_finish
> debug 2022-03-29T11:10:53.695+0000 7f81c0811700 15 
> mon.controller2@-1(probing) e16 health_tick_stop
> debug 2022-03-29T11:10:53.695+0000 7f81c0811700 15 
> mon.controller2@-1(probing) e16 health_interval_stop
> debug 2022-03-29T11:10:53.695+0000 7f81c0811700 10 
> mon.controller2@-1(probing) e16 scrub_event_cancel
> debug 2022-03-29T11:10:53.695+0000 7f81c0811700 10 
> mon.controller2@-1(probing) e16 scrub_reset
> debug 2022-03-29T11:10:53.695+0000 7f81c0811700 10 
> mon.controller2@-1(probing) e16 cancel_probe_timeout (none scheduled)
> debug 2022-03-29T11:10:53.695+0000 7f81c0811700 10 
> mon.controller2@-1(probing) e16 reset_probe_timeout 0x55c46fbb8d80 after 2 
> seconds
> debug 2022-03-29T11:10:53.695+0000 7f81c0811700 10 
> mon.controller2@-1(probing) e16 probing other monitors
> debug 2022-03-29T11:10:53.695+0000 7f81be00c700 20 
> mon.controller2@-1(probing) e16 _ms_dispatch existing session 0x55c46f8d4900 
> for mon.2
> debug 2022-03-29T11:10:53.695+0000 7f81be00c700 20 
> mon.controller2@-1(probing) e16  entity_name  global_id 0 (none) caps allow *
> debug 2022-03-29T11:10:53.695+0000 7f81be00c700 20 is_capable service=mon 
> command= read addr v2:192.168.9.210:3300/0 on cap allow *
> debug 2022-03-29T11:10:53.695+0000 7f81be00c700 20  allow so far , doing 
> grant allow *
> debug 2022-03-29T11:10:53.695+0000 7f81be00c700 20  allow all
> debug 2022-03-29T11:10:53.695+0000 7f81be00c700 10 
> mon.controller2@-1(probing) e16 handle_probe mon_probe(reply 
> 9d036488-fb4f-4e5b-85ec-4ccf75501b48 name controller5 quorum 0,1,2 leader 0 
> paxos( fc 133912517 lc 133913211 ) mon_release pacific) v8
> debug 2022-03-29T11:10:53.695+0000 7f81be00c700 10 
> mon.controller2@-1(probing) e16 handle_probe_reply mon.2 
> v2:192.168.9.210:3300/0 mon_probe(reply 9d036488-fb4f-4e5b-85ec-4ccf75501b48 
> name controller5 quorum 0,1,2 leader 0 paxos( fc 133912517 lc 133913211 ) 
> mon_release pacific) v8
> debug 2022-03-29T11:10:53.695+0000 7f81be00c700 10 
> mon.controller2@-1(probing) e16  monmap is e16: 3 mons at 
> {controller1=[v2:192.168.9.206:3300/0,v1:192.168.9.206:6789/0],controller4=[v2:192.168.9.209:3300/0,v1:192.168.9.209:6789/0],controller5=[v2:192.168.9.210:3300/0,v1:192.168.9.210:6789/0]}
> debug 2022-03-29T11:10:53.695+0000 7f81be00c700 10 
> mon.controller2@-1(probing) e16  peer name is controller5
> debug 2022-03-29T11:10:53.695+0000 7f81be00c700 10 
> mon.controller2@-1(probing) e16  existing quorum 0,1,2
> debug 2022-03-29T11:10:53.695+0000 7f81be00c700 10 
> mon.controller2@-1(probing) e16  peer paxos version 133913211 vs my version 
> 133913204 (ok)
> debug 2022-03-29T11:10:53.695+0000 7f81be00c700 10 
> mon.controller2@-1(probing) e16  ready to join, but i'm not in the monmap/my 
> addr is blank/location is wrong, trying to join
> debug 2022-03-29T11:10:53.695+0000 7f81be00c700 20 
> mon.controller2@-1(probing) e16 _ms_dispatch existing session 0x55c46f8d4b40 
> for mon.1
> debug 2022-03-29T11:10:53.695+0000 7f81be00c700 20 
> mon.controller2@-1(probing) e16  entity_name  global_id 0 (none) caps allow *
> debug 2022-03-29T11:10:53.695+0000 7f81be00c700 20 is_capable service=mon 
> command= read addr v2:192.168.9.209:3300/0 on cap allow *
> debug 2022-03-29T11:10:53.695+0000 7f81be00c700 20  allow so far , doing 
> grant allow *
> debug 2022-03-29T11:10:53.695+0000 7f81be00c700 20  allow all
> debug 2022-03-29T11:10:53.695+0000 7f81be00c700 10 
> mon.controller2@-1(probing) e16 handle_probe mon_probe(reply 
> 9d036488-fb4f-4e5b-85ec-4ccf75501b48 name controller4 quorum 0,1,2 leader 0 
> paxos( fc 133912517 lc 133913211 ) mon_release pacific) v8
> debug 2022-03-29T11:10:53.695+0000 7f81be00c700 10 
> mon.controller2@-1(probing) e16 handle_probe_reply mon.1 
> v2:192.168.9.209:3300/0 mon_probe(reply 9d036488-fb4f-4e5b-85ec-4ccf75501b48 
> name controller4 quorum 0,1,2 leader 0 paxos( fc 133912517 lc 133913211 ) 
> mon_release pacific) v8
> debug 2022-03-29T11:10:53.695+0000 7f81be00c700 10 
> mon.controller2@-1(probing) e16  monmap is e16: 3 mons at 
> {controller1=[v2:192.168.9.206:3300/0,v1:192.168.9.206:6789/0],controller4=[v2:192.168.9.209:3300/0,v1:192.168.9.209:6789/0],controller5=[v2:192.168.9.210:3300/0,v1:192.168.9.210:6789/0]}
> debug 2022-03-29T11:10:53.695+0000 7f81be00c700 10 
> mon.controller2@-1(probing) e16  peer name is controller4
> debug 2022-03-29T11:10:53.695+0000 7f81be00c700 10 
> mon.controller2@-1(probing) e16  existing quorum 0,1,2
> debug 2022-03-29T11:10:53.695+0000 7f81be00c700 10 
> mon.controller2@-1(probing) e16  peer paxos version 133913211 vs my version 
> 133913204 (ok)
> debug 2022-03-29T11:10:53.695+0000 7f81be00c700 10 
> mon.controller2@-1(probing) e16  ready to join, but i'm not in the monmap/my 
> addr is blank/location is wrong, trying to join
> debug 2022-03-29T11:10:54.453+0000 7f81c2014700 10 
> mon.controller2@-1(probing) e16 get_authorizer for mgr
> debug 2022-03-29T11:10:55.453+0000 7f81c2014700 10 
> mon.controller2@-1(probing) e16 get_authorizer for mgr
> debug 2022-03-29T11:10:55.695+0000 7f81c0811700  4 
> mon.controller2@-1(probing) e16 probe_timeout 0x55c46fbb8d80
> debug 2022-03-29T11:10:55.695+0000 7f81c0811700 10 
> mon.controller2@-1(probing) e16 bootstrap
> 
> Kind Regards,
> Thomas Bruckmann
> Systemadministrator Cloud Dienste
> E
> thomas.bruckm...@softgarden.de<mailto:%20thomas.bruckm...@softgarden.de;>
> softgarden e-recruiting GmbH
> Tauentzienstraße 14 | 10789 Berlin
> https://softgarden.de/
> Gesellschaft mit beschränkter Haftung, Amtsgericht Berlin-Charlottenburg
> HRB 114159 B | USt-ID: DE260440441 | Geschäftsführer: Mathias Heese, Stefan 
> Schüffler, Claus Müller
> 
> 
> Von: Thomas Bruckmann <thomas.bruckm...@softgarden.de>
> Datum: Donnerstag, 24. März 2022 um 17:06
> An: ceph-users@ceph.io <ceph-users@ceph.io>
> Betreff: [ceph-users] Ceph Mon not able to authenticate
> Hello,
> We are running ceph 16.2.6 and having trouble with our mon’s everything is 
> managed via ceph orch and running in containers. Since we switched our 
> firewall in the DC (which also makes DNS) our ceph mon daemons are not able 
> to authenticate when they are restarted.
> 
> The errormessage in the monitor log is:
> 
> debug 2022-03-24T14:25:12.716+0000 7fa0dc2df700 1 mon.2@-1(probing) e13 
> handle_auth_request failed to assign global_id
> 
> What we already tried to solve the problem:
> 
>  *   Removed the mon fully from the node (including all artifacts in the FS)
>  *   Doublechecked if the mon is still in the monmap after removing it (it is 
> not)
>  *   Added other mons (which were previously no mons) to ensure a unique and 
> synced monmap and tried adding the failing mon -> no success
>  *   Shutted down a running mon (no one of the brand new) and tried bringing 
> it up again -> same error
> 
> It seems not to be an error with the monmap, however manipulating the monmap 
> manually is currently not possible, since the system is prod and we cannot 
> shutdown the whole FS.
> 
> Another Blogpost, I do not find the link anymore, say the problem could be 
> related to the dns resolution somehow, that may the dns name behind the IP 
> has changed. For each of our initial mons, we have 3 different DNS names, 
> which are returned on a reverse lookup, since we switched the Firewall, may 
> to order those names are returned has changed. Don’t know if this could be to 
> problem.
> 
> Does may anyone has an Idea how to solve the Problem?
> 
> Kind Regards,
> Thomas Bruckmann
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to