Hi everyone,

I figure it's time to pull in more brain power on this one.  We had an NVMe 
mostly die in one of our monitors and it caused the write latency for the 
machine to spike.  Ceph did the RightThing(tm) and when it lost quorum on that 
machine it was ignored.  I pulled the bad drive out of the array and tried to 
bring the mon and mgr back in (our monitors double-duty as managers).

The manager came up 0 problems but the monitor got stuck probing.  

I removed the bad host from the monmap and stood up a new one on an OSD node to 
get back to 3 active.  That new node added perfectly using the same methods 
I've tried on the old one.

Network appears to be clean between all hosts.  Packet captures show them 
chatting just fine.  Since we are getting ready to upgrade from RHEL7 to RHEL8 
I took this as an opportunity to reinstall the monitor as an 8 box to get that 
process rolling.  Box is now on RHEL8 with no changes to how ceph-mon is acting.

I install machines with a kickstart and use our own ansible roles to get it 95% 
into service.  I then follow the manual install instructions 
(https://docs.ceph.com/en/latest/rados/operations/add-or-rm-mons/#adding-monitors).

Time is in sync, /var/lib/ceph/mon/* is owned by the right UID, keys are in 
sync, configs are in sync.  I pulled the old mon out of "mon initial members" 
and "mon host".  `nc` can talk to all the ports in question and we've tried it 
with firewalld off as well (ditto with selinux).  Cleaned up some stale DNS and 
even tried a different IP (same DNS name). I started all of this with 14.2.12 
but .13 was released while debugging so I've got that on the broken monitor at 
the moment.

I manually start the daemon in debug mode (/usr/bin/ceph-mon -d --cluster ceph 
--id ceph-mon-02 --setuser ceph --setgroup ceph) until it's joined in then use 
the systemd scripts to start it once it's clean.  The current state is:

(Lightly sanitized output)
:snip:
2020-11-04 11:38:57.049 7f4232fb3540  0 mon.ceph-mon-02 does not exist in 
monmap, will attempt to join an existing cluster
2020-11-04 11:38:57.049 7f4232fb3540  0 using public_addr v2:Num.64:0/0 -> 
[v2:Num.64:3300/0,v1:Num.64:6789/0]
2020-11-04 11:38:57.050 7f4232fb3540  0 starting mon.ceph-mon-02 rank -1 at 
public addrs [v2:Num.64:3300/0,v1:Num.64:6789/0] at bind addrs 
[v2:Num.64:3300/0,v1:Num.64:6789/0] mon_data /var/lib/ceph/mon/ceph-ceph-mon-02 
fsid 8514c8d5-4cd3-4dee-b460-27633e3adb1a
2020-11-04 11:38:57.051 7f4232fb3540  1 mon.ceph-mon-02@-1(???) e25 preinit 
fsid 8514c8d5-4cd3-4dee-b460-27633e3adb1a
2020-11-04 11:38:57.051 7f4232fb3540  1 mon.ceph-mon-02@-1(???) e25  
initial_members ceph-mon-01,ceph-mon-03, filtering seed monmap
2020-11-04 11:38:57.051 7f4232fb3540  0 mon.ceph-mon-02@-1(???).mds e430081 new 
map
2020-11-04 11:38:57.051 7f4232fb3540  0 mon.ceph-mon-02@-1(???).mds e430081 
print_map
:snip:
2020-11-04 11:38:57.053 7f4232fb3540  0 mon.ceph-mon-02@-1(???).osd e1198618 
crush map has features 288514119978713088, adjusting msgr requires
2020-11-04 11:38:57.053 7f4232fb3540  0 mon.ceph-mon-02@-1(???).osd e1198618 
crush map has features 288514119978713088, adjusting msgr requires
2020-11-04 11:38:57.053 7f4232fb3540  0 mon.ceph-mon-02@-1(???).osd e1198618 
crush map has features 3314933069571702784, adjusting msgr requires
2020-11-04 11:38:57.053 7f4232fb3540  0 mon.ceph-mon-02@-1(???).osd e1198618 
crush map has features 288514119978713088, adjusting msgr requires
2020-11-04 11:38:57.054 7f4232fb3540  1 
mon.ceph-mon-02@-1(???).paxosservice(auth 54141..54219) refresh upgraded, 
format 0 -> 3
2020-11-04 11:38:57.069 7f421d891700  1 mon.ceph-mon-02@-1(probing) e25 
handle_auth_request failed to assign global_id
 ^^^ last line repeated every few seconds until process killed

I've exhausted everything I can think of so I've just been doing the scientific 
shotgun (one slug at a time) approach to see what changes.  Does anyone else 
have any ideas?

--
Paul Mezzanini
Sr Systems Administrator / Engineer, Research Computing
Information & Technology Services
Finance & Administration
Rochester Institute of Technology
o:(585) 475-3245 | pfm...@rit.edu

CONFIDENTIALITY NOTE: The information transmitted, including attachments, is
intended only for the person(s) or entity to which it is addressed and may
contain confidential and/or privileged material. Any review, retransmission,
dissemination or other use of, or taking of any action in reliance upon this
information by persons or entities other than the intended recipient is
prohibited. If you received this in error, please contact the sender and
destroy any copies of this information.
------------------------
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to