[ceph-users] Re: mgr daemons becoming unresponsive

2019-11-02 Thread Thomas
Hi Oliver, I experienced a situation where MGRs "goes crazy", means MGR was active but not working. In the logs of the standby MGR nodes I found an error (after restarting service) that pointed to Ceph Dashboard. Since disabling the dashboard my MGRs are stable again. Regards Thomas Am 02.1

[ceph-users] Re: iSCSI write performance

2019-11-02 Thread Maged Mokhtar
On 31/10/2019 18:45, Paul Emmerich wrote: On Fri, Oct 25, 2019 at 11:14 PM Maged Mokhtar wrote: 3. vmotion between Ceph datastore and an external datastore..this will be bad. This seems the case you are testing. It is bad because between 2 different storage systems (iqns are served on differ

[ceph-users] Re: mgr daemons becoming unresponsive

2019-11-02 Thread Oliver Freyermuth
Hi Thomas, indeed, I also had the dashboard open at these times - but right now, after disabling device health metrics, I can not retrigger it even when playing wildly on the dashboard. So I'll now reenable health metrics and try to retrigger the issue with cranked up debug levels as Sage sugg

[ceph-users] Re: mgr daemons becoming unresponsive

2019-11-02 Thread Reed Dier
Do you also have the balancer module on? I experienced extremely bad stability issues where the MGRs would silently die with the balancer module on. And by on, I mean 'active:true` by way of `ceph balancer on`. Once I disabled the automatic balancer, it seemed to become much more stable. I can

[ceph-users] Re: mgr daemons becoming unresponsive

2019-11-02 Thread Oliver Freyermuth
Dear Reed, yes, also the balancer is on for me - but the instabilities vanished as soon as I turned off device health metrics. Cheers, Oliver Am 02.11.19 um 17:31 schrieb Reed Dier: > Do you also have the balancer module on? > > I experienced extremely bad stability issues where the MGRs woul

[ceph-users] Re: mgr daemons becoming unresponsive

2019-11-02 Thread Oliver Freyermuth
Dear Sage, at least for the simple case: ceph device get-health-metrics osd.11 => mgr crashes (but in that case, it crashes fully, i.e. the process is gone) I have now uploaded a verbose log as: ceph-post-file: e3bd60ad-cbce-4308-8b07-7ebe7998572e One potential cause of this (and maybe the other

[ceph-users] Re: mgr daemons becoming unresponsive

2019-11-02 Thread Oliver Freyermuth
Dear Sage, good news - it happened again, with debug logs! There's nothing obvious to my eye, it's uploaded as: 0b2d0c09-46f3-4126-aa27-e2d2e8572741 It seems the failure was roughly in parallel to me wanting to access the dashboard. It must have happened within the last ~5-10 minutes of the log.

[ceph-users] Re: mgr daemons becoming unresponsive

2019-11-02 Thread Janek Bevendorff
These issues sound a bit like a bug I reported a few days ago: https://tracker.ceph.com/issues/39264 Related: https://tracker.ceph.com/issues/39264 On 02/11/2019 17:34, Oliver Freyermuth wr

[ceph-users] Re: mgr daemons becoming unresponsive

2019-11-02 Thread Oliver Freyermuth
Dear Janek, in my case, the mgr daemon itself remains "running", it just stops reporting to the mon. It even still serves the dashboard, but with outdated information. I grepped through the logs and could not find any clock skew messages. So it seems to be a different issue (albeit both issues

[ceph-users] Re: mgr daemons becoming unresponsive

2019-11-02 Thread Thomas
Hi, I experience major issues with MGR and by chance my drives are on non-JBOD controllers, too (like Oliver's drives). Regards Thomas Am 02.11.2019 um 17:38 schrieb Oliver Freyermuth: Dear Sage, at least for the simple case: ceph device get-health-metrics osd.11 => mgr crashes (but in th

[ceph-users] Re: mgr daemons becoming unresponsive

2019-11-02 Thread Thomas
Hi, on the error log of my active MGR I find these errors after some time: 2019-11-02 19:07:30.629 7f448f1cb700  0 auth: could not find secret_id=3769 2019-11-02 19:07:30.629 7f448f1cb700  0 cephx: verify_authorizer could not get service secret for service mgr secret_id=3769 2019-11-02 19:07:30

[ceph-users] Device Health Metrics on EL 7

2019-11-02 Thread Oliver Freyermuth
Dear Cephers, I went through some of the OSD logs of our 14.2.4 nodes and found this: -- Nov 01 01:22:25 sudo[1087697]: ceph : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/sbin/smartctl -a --json /dev/sds Nov 01 01:22:51 sudo[1087729]: pam_unix(sudo:auth): conv