Hi all,

one of our MONs was down for maintenance for ca. 45 minutes. After this time I 
started it up again and it joined the cluster.

Unfortunately, things did not go as expected. The MON sub-cluster became 
unresponsive for a bit more than 10 minutes. Admin commands would hang, even if 
issued directly to a specific monitor via "ceph tell mon.xxx". In addition, our 
MDS lost connection to the MONs and reported a laggy connection. Consequently, 
all ceph fs access was frozen for a bit more than 10 minutes as well.

>From the little I could get out with "ceph daemon mon.xxx mon_status" I could 
>see that the restarted MON was in state "synchronizing" (or similar, its from 
>memory) while the other mons were in quorum.

Our cluster is mimic-12.2.8. Somehow, this observation does not fit together 
with the intended HA of the MON cluster, there should not be any stall at all.

My questions: Why do the MONs become unresponsive for such a long time? What 
are the MONs doing during this time frame? Are there any config options I 
should look at? Are there any log messages I should hunt for?

Any hint is appreciated.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to