Over the weekend, all five MGRs failed, which means we have no more Prometheus monitoring data. We are obviously monitoring the MGR status as well, so we can detect the failure, but it's still a pretty serious issue. Any ideas as to why this might happen?

On 13/03/2020 16:56, Janek Bevendorff wrote:
Indeed. I just had another MGR go bye-bye. I don't think host clock skew is the problem.


On 13/03/2020 15:29, Anthony D'Atri wrote:
Chrony does converge faster, but I doubt this will solve your problem if you don’t have quality peers. Or if it’s not really a time problem.

On Mar 13, 2020, at 6:44 AM, Janek Bevendorff <janek.bevendo...@uni-weimar.de> wrote:

I replaced ntpd with chronyd and will let you know if it changes anything. Thanks.


On 13/03/2020 06:25, Konstantin Shalygin wrote:
On 3/13/20 12:57 AM, Janek Bevendorff wrote:
NTPd is running, all the nodes have the same time to the second. I don't think that is the problem.
As always in such cases - try to switch your ntpd to default EL7 daemon - chronyd.



k
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

--
Bauhaus-Universität Weimar
Bauhausstr. 9a, Room 308
99423 Weimar, Germany

Phone: +49 (0)3643 - 58 3577
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to