[ceph-users] Re: mgr daemons becoming unresponsive

2019-11-07 Thread Gregory Farnum
On Wed, Nov 6, 2019 at 1:29 PM Sage Weil wrote: > > My current working theory is that the mgr is getting hung up when it tries > to scrape the device metrics from the mon. The 'tell' mechanism used to > send mon-targetted commands is pretty kludgey/broken in nautilus and > earlier. It's been rew

[ceph-users] Re: mgr daemons becoming unresponsive

2019-11-07 Thread Oliver Freyermuth
Dear Sage, Am 07.11.19 um 14:33 schrieb Sage Weil: On Thu, 7 Nov 2019, Thomas Schneider wrote: Hi, I have installed package ceph-mgr_14.2.4-1-gd592e56-1bionic_amd64.deb manually: root@ld5505:/home# dpkg --force-depends -i ceph-mgr_14.2.4-1-gd592e56-1bionic_amd64.deb (Reading database ... 10746

[ceph-users] Re: mgr daemons becoming unresponsive

2019-11-07 Thread Thomas Schneider
Hi, looks like I sent my previous email too soon. The error 2019-11-07 15:53:06.077 7f7ea8afe700  0 auth: could not find secret_id=3887 2019-11-07 15:53:06.077 7f7ea8afe700  0 cephx: verify_authorizer could not get service secret for service mgr secret_id=3887 is back in MGR log. ;-( Am 07.11.

[ceph-users] Re: mgr daemons becoming unresponsive

2019-11-07 Thread Thomas Schneider
Hi, I have installed all ceph packages from Sage's repo, means ceph ceph-common ceph-mds ceph-mgr-dashboard ceph-mon ceph-osd libcephfs2 librados2 libradosstriper1 librbd1 librgw2 python-ceph-argparse python-cephfs python-rados python-rbd python-rgw after adding his repo and executing apt upgrade

[ceph-users] Re: mgr daemons becoming unresponsive

2019-11-07 Thread Sage Weil
On Thu, 7 Nov 2019, Thomas Schneider wrote: > Hi, > > I have installed package > ceph-mgr_14.2.4-1-gd592e56-1bionic_amd64.deb > manually: > root@ld5505:/home# dpkg --force-depends -i > ceph-mgr_14.2.4-1-gd592e56-1bionic_amd64.deb > (Reading database ... 107461 files and directories currently insta

[ceph-users] Re: mgr daemons becoming unresponsive

2019-11-07 Thread Thomas Schneider
Hi, I have installed package ceph-mgr_14.2.4-1-gd592e56-1bionic_amd64.deb manually: root@ld5505:/home# dpkg --force-depends -i ceph-mgr_14.2.4-1-gd592e56-1bionic_amd64.deb (Reading database ... 107461 files and directories currently installed.) Preparing to unpack ceph-mgr_14.2.4-1-gd592e56-1bioni

[ceph-users] Re: mgr daemons becoming unresponsive

2019-11-07 Thread Oliver Freyermuth
Dear Thomas, the most correct thing to do is probably to add the full repo (the original link was still empty for me, but https://shaman.ceph.com/repos/ceph/wip-no-scrape-mons-nautilus/ seems to work). The commit itself suggests the ceph-mgr package should be sufficient. I'm still pondering tho

[ceph-users] Re: mgr daemons becoming unresponsive

2019-11-06 Thread Thomas Schneider
Hi, can you please advise which package(s) should be installed? Thanks Am 06.11.2019 um 22:28 schrieb Sage Weil: > My current working theory is that the mgr is getting hung up when it tries > to scrape the device metrics from the mon. The 'tell' mechanism used to > send mon-targetted comma

[ceph-users] Re: mgr daemons becoming unresponsive

2019-11-06 Thread Sage Weil
My current working theory is that the mgr is getting hung up when it tries to scrape the device metrics from the mon. The 'tell' mechanism used to send mon-targetted commands is pretty kludgey/broken in nautilus and earlier. It's been rewritten for octopus, but isn't worth backporting--it ne

[ceph-users] Re: mgr daemons becoming unresponsive

2019-11-06 Thread Thomas Schneider
Well, even after restarting the MGR service the relevant log is spoiled with this error messages: 2019-11-06 17:46:22.363 7f81ffdcc700  0 auth: could not find secret_id=3865 2019-11-06 17:46:22.363 7f81ffdcc700  0 cephx: verify_authorizer could not get service secret for service mgr secret_id=3865

[ceph-users] Re: mgr daemons becoming unresponsive

2019-11-06 Thread Thomas Schneider
Hi, does anybody get this error messages in MGR log? 2019-11-06 15:41:44.765 7f10db740700  0 auth: could not find secret_id=3863 2019-11-06 15:41:44.765 7f10db740700  0 cephx: verify_authorizer could not get service secret for service mgr secret_id=3863 THX Am 06.11.2019 um 10:43 schrieb Oliver

[ceph-users] Re: mgr daemons becoming unresponsive

2019-11-06 Thread thoralf schulze
hi oliver, On 11/6/19 10:43 AM, Oliver Freyermuth wrote: […] > Did somebody see something similar after running for a week or more with > Nautilus on old and slow hardware? yes, same here: significantly more mgr failovers / compaction jobs with nautilus than with mimic … most likely due to pgs be

[ceph-users] Re: mgr daemons becoming unresponsive

2019-11-06 Thread Oliver Freyermuth
Hi together, interestingly, now that the third mon is missing for almost a week (those planned interventions always take longer than expected...), we get mgr failovers (but without crashes). In the mgr log, I find: 2019-11-06 07:50:05.409 7fce8a0dc700 0 client.0 ms_handle_reset on v2:10.160.

[ceph-users] Re: mgr daemons becoming unresponsive

2019-11-04 Thread Janek Bevendorff
On 02.11.19 18:35, Oliver Freyermuth wrote: Dear Janek, in my case, the mgr daemon itself remains "running", it just stops reporting to the mon. It even still serves the dashboard, but with outdated information. This is not so different. The MGRs in my case are running, but stop responding.

[ceph-users] Re: mgr daemons becoming unresponsive

2019-11-02 Thread Thomas
Hi, on the error log of my active MGR I find these errors after some time: 2019-11-02 19:07:30.629 7f448f1cb700  0 auth: could not find secret_id=3769 2019-11-02 19:07:30.629 7f448f1cb700  0 cephx: verify_authorizer could not get service secret for service mgr secret_id=3769 2019-11-02 19:07:30

[ceph-users] Re: mgr daemons becoming unresponsive

2019-11-02 Thread Thomas
Hi, I experience major issues with MGR and by chance my drives are on non-JBOD controllers, too (like Oliver's drives). Regards Thomas Am 02.11.2019 um 17:38 schrieb Oliver Freyermuth: Dear Sage, at least for the simple case: ceph device get-health-metrics osd.11 => mgr crashes (but in th

[ceph-users] Re: mgr daemons becoming unresponsive

2019-11-02 Thread Oliver Freyermuth
Dear Janek, in my case, the mgr daemon itself remains "running", it just stops reporting to the mon. It even still serves the dashboard, but with outdated information. I grepped through the logs and could not find any clock skew messages. So it seems to be a different issue (albeit both issues

[ceph-users] Re: mgr daemons becoming unresponsive

2019-11-02 Thread Janek Bevendorff
These issues sound a bit like a bug I reported a few days ago: https://tracker.ceph.com/issues/39264 Related: https://tracker.ceph.com/issues/39264 On 02/11/2019 17:34, Oliver Freyermuth wr

[ceph-users] Re: mgr daemons becoming unresponsive

2019-11-02 Thread Oliver Freyermuth
Dear Sage, good news - it happened again, with debug logs! There's nothing obvious to my eye, it's uploaded as: 0b2d0c09-46f3-4126-aa27-e2d2e8572741 It seems the failure was roughly in parallel to me wanting to access the dashboard. It must have happened within the last ~5-10 minutes of the log.

[ceph-users] Re: mgr daemons becoming unresponsive

2019-11-02 Thread Oliver Freyermuth
Dear Sage, at least for the simple case: ceph device get-health-metrics osd.11 => mgr crashes (but in that case, it crashes fully, i.e. the process is gone) I have now uploaded a verbose log as: ceph-post-file: e3bd60ad-cbce-4308-8b07-7ebe7998572e One potential cause of this (and maybe the other

[ceph-users] Re: mgr daemons becoming unresponsive

2019-11-02 Thread Oliver Freyermuth
Dear Reed, yes, also the balancer is on for me - but the instabilities vanished as soon as I turned off device health metrics. Cheers, Oliver Am 02.11.19 um 17:31 schrieb Reed Dier: > Do you also have the balancer module on? > > I experienced extremely bad stability issues where the MGRs woul

[ceph-users] Re: mgr daemons becoming unresponsive

2019-11-02 Thread Reed Dier
Do you also have the balancer module on? I experienced extremely bad stability issues where the MGRs would silently die with the balancer module on. And by on, I mean 'active:true` by way of `ceph balancer on`. Once I disabled the automatic balancer, it seemed to become much more stable. I can

[ceph-users] Re: mgr daemons becoming unresponsive

2019-11-02 Thread Oliver Freyermuth
Hi Thomas, indeed, I also had the dashboard open at these times - but right now, after disabling device health metrics, I can not retrigger it even when playing wildly on the dashboard. So I'll now reenable health metrics and try to retrigger the issue with cranked up debug levels as Sage sugg

[ceph-users] Re: mgr daemons becoming unresponsive

2019-11-02 Thread Thomas
Hi Oliver, I experienced a situation where MGRs "goes crazy", means MGR was active but not working. In the logs of the standby MGR nodes I found an error (after restarting service) that pointed to Ceph Dashboard. Since disabling the dashboard my MGRs are stable again. Regards Thomas Am 02.1

[ceph-users] Re: mgr daemons becoming unresponsive

2019-11-01 Thread Sage Weil
On Sat, 2 Nov 2019, Oliver Freyermuth wrote: > Dear Cephers, > > interestingly, after: > ceph device monitoring off > the mgrs seem to be stable now - the active one still went silent a few > minutes later, > but the standby took over and was stable, and restarting the broken one, it's > now st

[ceph-users] Re: mgr daemons becoming unresponsive

2019-11-01 Thread Oliver Freyermuth
Dear Cephers, interestingly, after: ceph device monitoring off the mgrs seem to be stable now - the active one still went silent a few minutes later, but the standby took over and was stable, and restarting the broken one, it's now stable since an hour, too, so probably, a restart of the mgr is