[ceph-users] Re: mgr daemons becoming unresponsive

2019-11-02 Thread Thomas

Hi Oliver,

I experienced a situation where MGRs "goes crazy", means MGR was active 
but not working.
In the logs of the standby MGR nodes I found an error (after restarting 
service) that pointed to Ceph Dashboard.


Since disabling the dashboard my MGRs are stable again.

Regards
Thomas

Am 02.11.2019 um 02:48 schrieb Oliver Freyermuth:

Dear Cephers,

interestingly, after:
  ceph device monitoring off
the mgrs seem to be stable now - the active one still went silent a few minutes 
later,
but the standby took over and was stable, and restarting the broken one, it's 
now stable since an hour, too,
so probably, a restart of the mgr is needed after disabling device monitoring 
to get things stable again.

So it seems to be caused by a problem with the device health metrics. In case 
this is a red herring and mgrs become instable again in the next days,
I'll let you know.

Cheers,
Oliver

Am 01.11.19 um 23:09 schrieb Oliver Freyermuth:

Dear Cephers,

this is a 14.2.4 cluster with device health metrics enabled - since about a day, all mgr daemons go 
"silent" on me after a few hours, i.e. "ceph -s" shows:

   cluster:
 id: 269cf2b2-7e7c-4ceb-bd1b-a33d915ceee9
 health: HEALTH_WARN
 no active mgr
 1/3 mons down, quorum mon001,mon002
  
   services:

 mon:3 daemons, quorum mon001,mon002 (age 57m), out of quorum: 
mon003
 mgr:no daemons active (since 56m)
 ...
(the third mon has a planned outage and will come back in a few days)

Checking the logs of the mgr daemons, I find some "reset" messages at the time when it 
goes "silent", first for the first mgr:

2019-11-01 21:34:40.286 7f2df6a6b700  0 log_channel(cluster) log [DBG] : pgmap 
v1798: 1585 pgs: 1585 active+clean; 1.1 TiB data, 2.3 TiB used, 136 TiB / 138 
TiB avail
2019-11-01 21:34:41.458 7f2e0d59b700  0 client.0 ms_handle_reset on 
v2:10.160.16.1:6800/401248
2019-11-01 21:34:42.287 7f2df6a6b700  0 log_channel(cluster) log [DBG] : pgmap 
v1799: 1585 pgs: 1585 active+clean; 1.1 TiB data, 2.3 TiB used, 136 TiB / 138 
TiB avail

and a bit later, on the standby mgr:

2019-11-01 22:18:14.892 7f7bcc8ae700  0 log_channel(cluster) log [DBG] : pgmap 
v1798: 1585 pgs: 166 active+clean+snaptrim, 858 active+clean+snaptrim_wait, 561 
active+clean; 1.1 TiB data, 2.3 TiB used, 136 TiB / 138 TiB avail
2019-11-01 22:18:16.022 7f7be9e72700  0 client.0 ms_handle_reset on 
v2:10.160.16.2:6800/352196
2019-11-01 22:18:16.893 7f7bcc8ae700  0 log_channel(cluster) log [DBG] : pgmap 
v1799: 1585 pgs: 166 active+clean+snaptrim, 858 active+clean+snaptrim_wait, 561 
active+clean; 1.1 TiB data, 2.3 TiB used, 136 TiB / 138 TiB avail

Interestingly, the dashboard still works, but presents outdated information, 
and for example zero I/O going on.
I believe this started to happen mainly after the third mon went into the known 
downtime, but I am not fully sure if this was the trigger, since the cluster is 
still growing.
It may also have been the addition of 24 more OSDs.


I also find other messages in the mgr logs which seem problematic, but I am not 
sure they are related:
--
2019-11-01 21:17:09.849 7f2df4266700  0 mgr[devicehealth] Error reading OMAP: 
[errno 22] Failed to operate read op for oid
Traceback (most recent call last):
   File "/usr/share/ceph/mgr/devicehealth/module.py", line 396, in 
put_device_metrics
 ioctx.operate_read_op(op, devid)
   File "rados.pyx", line 516, in rados.requires.wrapper.validate_func 
(/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.4/rpm/el7/BUIL
D/ceph-14.2.4/build/src/pybind/rados/pyrex/rados.c:4721)
   File "rados.pyx", line 3474, in rados.Ioctx.operate_read_op 
(/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.4/rpm/el7/BUILD/ceph-14.2.4/build/src/pybind/rados/pyrex/rados.c:36554)
InvalidArgumentError: [errno 22] Failed to operate read op for oid
--
or:
--
2019-11-01 21:33:53.977 7f7bd38bc700  0 mgr[devicehealth] Fail to parse JSON 
result from daemon osd.51 ()
2019-11-01 21:33:53.978 7f7bd38bc700  0 mgr[devicehealth] Fail to parse JSON 
result from daemon osd.52 ()
2019-11-01 21:33:53.979 7f7bd38bc700  0 mgr[devicehealth] Fail to parse JSON 
result from daemon osd.53 ()
--

The reason why I am cautious about the health metrics is that I observed a 
crash when trying to query them:
--
2019-11-01 20:21:23.661 7fa46314a700  0 log_channel(audit) log [DBG] : from='client.174136 -' entity='client.admin' cmd=[{"prefix": 
"device get-health-metrics", "devid": "osd.11", "target": ["mgr", ""]}]: dispatch
2019-11-01 20:21:23.661 7fa46394b700  0 mgr[devicehealth] handle_command
2019-11-01 20:21:23.663 7fa46394b700 -1 *** Caught signal (Segmentatio

[ceph-users] Re: iSCSI write performance

2019-11-02 Thread Maged Mokhtar

On 31/10/2019 18:45, Paul Emmerich wrote:


On Fri, Oct 25, 2019 at 11:14 PM Maged Mokhtar  wrote:

3. vmotion between Ceph datastore and an external datastore..this will be bad. 
This seems the case you are testing. It is bad because between 2 different 
storage systems (iqns are served on different targets), vaai xcopy cannot be 
used and vmware does its own stuff. It moves data using 64k block size, which 
gives low performance...to add some flavor, it does indeed use 32 threads, but 
unfortunately they use co-located addresses which does not work well in Ceph as 
they are hitting the same rbd object, which gets serialized due to pg locks, so 
you will not get any palatalization. Your speed will mostly be determined by a 
serial 64k, so with 1 ms write latency for ssd cluster, you will get around 64 
MB/s..it will be slightly higher as the extra threads have some low effect.

Yes, vmotion is the worst IO pattern ever for a sequential copy.

However, the situation you are describing can be fixed with RBD
striping v2, just make Ceph switch to another object every 64kb, see
https://docs.ceph.com/docs/master/dev/file-striping/

I'm not sure about the state of striping v2 support in the kernel
module, last time I checked it wasn't supported. But
ceph-iscsi/tcmu-runner got quite good over the past year, I don't see
any point in still using the kernel data path for iscsi nowadays.


Paul



If case 3 (migration to/from Ceph to external stores) is important, then 
maybe it is worth trying Veeam Quick Migration which supports live 
migration. We have not tried it, but maybe this tool could be smarter 
than vmware and use larger block sizes. They seem to have a free 
edition, if you try it, i would recommend bumping the iSCSI MaxIoSizeKB 
on the ESXi side as it helps with their backup tool.


I would not recommend using striping in this case, only very specific 
cases benefit from it..but for general purpose vm workloads, striping 
will probably be multiplying your queue depth and latencies with 
negative effects, specially with such small sizes.


Ilya added striping to kernel rbd maybe in mainstream 4.17, we had sent 
our own patches earlier but Ilya changes were more comprehensive. For 
kernel rbd and librbd/tcmu, both have pros and cons, the kernel path is 
used by SUSE, the main issue for us is the support for persistent 
reservations which is required for setting up clustering on Windows.


/Maged


Note your esxtop does show 32 active ios under ACTV, the QUED of zero does is not the 
queue depth, but rather the "queued" io the ESX would suspend in case your 
active reaches the maximum by adapater ( 128 ).

This is just to clarify, if case 3 is not your primary concern than i would 
forget about it and benchmark 1 and 2 if they are relevant. Else, if 3 is 
important, i am not sure you can do much as it is happening within 
vmware..maybe there could be a way to map the external iqn to be served by the 
same target serving the Ceph iqn then there could be a chance the xcopy could 
be activated..Mike would probably know if this has any chance of working :)

/Maged


On 25/10/2019 22:01, Ryan wrote:

esxtop is showing a queue length of 0

Storage motion to ceph
DEVICEPATH/WORLD/PARTITION DQLEN WQLEN ACTV 
QUED %USD  LOAD   CMDS/s  READS/s WRITES/s MBREAD/s MBWRTN/s DAVG/cmd KAVG/cmd 
GAVG/cmd QAVG/cmd
naa.6001405ec60d8b82342404d929fbbd03   - 128 -   32
0   25  0.25  1442.32 0.18  1440.50 0.0089.7821.32 0.01
21.34 0.01

Storage motion from ceph
DEVICEPATH/WORLD/PARTITION DQLEN WQLEN ACTV 
QUED %USD  LOAD   CMDS/s  READS/s WRITES/s MBREAD/s MBWRTN/s DAVG/cmd KAVG/cmd 
GAVG/cmd QAVG/cmd
naa.6001405ec60d8b82342404d929fbbd03   - 128 -   32
0   25  0.25  4065.38  4064.83 0.36   253.52 0.00 7.57 0.01 
7.58 0.00

I tried using fio like you mentioned but it was hanging with 
[r=0KiB/s,w=0KiB/s][r=0,w=0 IOPS] and the ETA kept climbing. I ended up using 
rbd bench on the ceph iscsi gateway. With a 64K write workload I'm seeing 
400MB/s transfers.

rbd create test --size 100G --image-feature layering
rbd map test
mkfs.ext4 /dev/rbd/rbd/test
mount /dev/rbd/rbd/test test

rbd create testec --size 100G --image-feature layering --data-pool rbd_ec
rbd map testec
mkfs.ext4 /dev/rbd/rbd/testec
mount /dev/rbd/rbd/testec testec

[root@ceph-iscsi1 mnt]# rbd bench --image test --io-size 64K --io-type write 
--io-total 10G
bench  type write io_size 65536 io_threads 16 bytes 10737418240 pattern 
sequential
   SEC   OPS   OPS/SEC   BYTES/SEC
 1  6368   6377.59  417961796.64
 2 12928   6462.27  423511630.71
 3 19296   6420.18  420752986.78
 4 26320   6585.61  431594792.67
 5 33296   6662.37  436624891.04
 6 40128   6754.67  442673957.25
 7 46784   6765.75  443400452.26
 8 53280   6809.02  446236110.93

[ceph-users] Re: mgr daemons becoming unresponsive

2019-11-02 Thread Oliver Freyermuth
Hi Thomas,

indeed, I also had the dashboard open at these times - but right now, after 
disabling device health metrics,
I can not retrigger it even when playing wildly on the dashboard. 

So I'll now reenable health metrics and try to retrigger the issue with cranked 
up debug levels as Sage suggested. 
Maybe in your case, if you can stand mgr failures, this would also be 
interesting to get the dashboard issue debugged? 

Cheers,
Oliver

Am 02.11.19 um 08:23 schrieb Thomas:
> Hi Oliver,
> 
> I experienced a situation where MGRs "goes crazy", means MGR was active but 
> not working.
> In the logs of the standby MGR nodes I found an error (after restarting 
> service) that pointed to Ceph Dashboard.
> 
> Since disabling the dashboard my MGRs are stable again.
> 
> Regards
> Thomas
> 
> Am 02.11.2019 um 02:48 schrieb Oliver Freyermuth:
>> Dear Cephers,
>>
>> interestingly, after:
>>   ceph device monitoring off
>> the mgrs seem to be stable now - the active one still went silent a few 
>> minutes later,
>> but the standby took over and was stable, and restarting the broken one, 
>> it's now stable since an hour, too,
>> so probably, a restart of the mgr is needed after disabling device 
>> monitoring to get things stable again.
>>
>> So it seems to be caused by a problem with the device health metrics. In 
>> case this is a red herring and mgrs become instable again in the next days,
>> I'll let you know.
>>
>> Cheers,
>> Oliver
>>
>> Am 01.11.19 um 23:09 schrieb Oliver Freyermuth:
>>> Dear Cephers,
>>>
>>> this is a 14.2.4 cluster with device health metrics enabled - since about a 
>>> day, all mgr daemons go "silent" on me after a few hours, i.e. "ceph -s" 
>>> shows:
>>>
>>>    cluster:
>>>  id: 269cf2b2-7e7c-4ceb-bd1b-a33d915ceee9
>>>  health: HEALTH_WARN
>>>  no active mgr
>>>  1/3 mons down, quorum mon001,mon002
>>>      services:
>>>  mon:    3 daemons, quorum mon001,mon002 (age 57m), out of quorum: 
>>> mon003
>>>  mgr:    no daemons active (since 56m)
>>>  ...
>>> (the third mon has a planned outage and will come back in a few days)
>>>
>>> Checking the logs of the mgr daemons, I find some "reset" messages at the 
>>> time when it goes "silent", first for the first mgr:
>>>
>>> 2019-11-01 21:34:40.286 7f2df6a6b700  0 log_channel(cluster) log [DBG] : 
>>> pgmap v1798: 1585 pgs: 1585 active+clean; 1.1 TiB data, 2.3 TiB used, 136 
>>> TiB / 138 TiB avail
>>> 2019-11-01 21:34:41.458 7f2e0d59b700  0 client.0 ms_handle_reset on 
>>> v2:10.160.16.1:6800/401248
>>> 2019-11-01 21:34:42.287 7f2df6a6b700  0 log_channel(cluster) log [DBG] : 
>>> pgmap v1799: 1585 pgs: 1585 active+clean; 1.1 TiB data, 2.3 TiB used, 136 
>>> TiB / 138 TiB avail
>>>
>>> and a bit later, on the standby mgr:
>>>
>>> 2019-11-01 22:18:14.892 7f7bcc8ae700  0 log_channel(cluster) log [DBG] : 
>>> pgmap v1798: 1585 pgs: 166 active+clean+snaptrim, 858 
>>> active+clean+snaptrim_wait, 561 active+clean; 1.1 TiB data, 2.3 TiB used, 
>>> 136 TiB / 138 TiB avail
>>> 2019-11-01 22:18:16.022 7f7be9e72700  0 client.0 ms_handle_reset on 
>>> v2:10.160.16.2:6800/352196
>>> 2019-11-01 22:18:16.893 7f7bcc8ae700  0 log_channel(cluster) log [DBG] : 
>>> pgmap v1799: 1585 pgs: 166 active+clean+snaptrim, 858 
>>> active+clean+snaptrim_wait, 561 active+clean; 1.1 TiB data, 2.3 TiB used, 
>>> 136 TiB / 138 TiB avail
>>>
>>> Interestingly, the dashboard still works, but presents outdated 
>>> information, and for example zero I/O going on.
>>> I believe this started to happen mainly after the third mon went into the 
>>> known downtime, but I am not fully sure if this was the trigger, since the 
>>> cluster is still growing.
>>> It may also have been the addition of 24 more OSDs.
>>>
>>>
>>> I also find other messages in the mgr logs which seem problematic, but I am 
>>> not sure they are related:
>>> --
>>> 2019-11-01 21:17:09.849 7f2df4266700  0 mgr[devicehealth] Error reading 
>>> OMAP: [errno 22] Failed to operate read op for oid
>>> Traceback (most recent call last):
>>>    File "/usr/share/ceph/mgr/devicehealth/module.py", line 396, in 
>>> put_device_metrics
>>>  ioctx.operate_read_op(op, devid)
>>>    File "rados.pyx", line 516, in rados.requires.wrapper.validate_func 
>>> (/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.4/rpm/el7/BUIL
>>> D/ceph-14.2.4/build/src/pybind/rados/pyrex/rados.c:4721)
>>>    File "rados.pyx", line 3474, in rados.Ioctx.operate_read_op 
>>> (/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.4/rpm/el7/BUILD/ceph-14.2.4/build/src/pybind/rados/pyrex/rados.c:36554)
>>> InvalidArgumentError: [errno 22] Failed to operate read op for oid
>>> --
>>> or:
>>> --
>>> 2019-11-01 21:33:53.977 7f

[ceph-users] Re: mgr daemons becoming unresponsive

2019-11-02 Thread Reed Dier
Do you also have the balancer module on?

I experienced extremely bad stability issues where the MGRs would silently die 
with the balancer module on.
And by on, I mean 'active:true` by way of `ceph balancer on`.

Once I disabled the automatic balancer, it seemed to become much more stable.

I can still manually run the balancer without issues (except for one pool), but 
the balancer is what appeared to be my big driver of instability.

Reed

> On Nov 2, 2019, at 11:24 AM, Oliver Freyermuth 
>  wrote:
> 
> Hi Thomas,
> 
> indeed, I also had the dashboard open at these times - but right now, after 
> disabling device health metrics,
> I can not retrigger it even when playing wildly on the dashboard. 
> 
> So I'll now reenable health metrics and try to retrigger the issue with 
> cranked up debug levels as Sage suggested. 
> Maybe in your case, if you can stand mgr failures, this would also be 
> interesting to get the dashboard issue debugged? 
> 
> Cheers,
> Oliver
> 
> Am 02.11.19 um 08:23 schrieb Thomas:
>> Hi Oliver,
>> 
>> I experienced a situation where MGRs "goes crazy", means MGR was active but 
>> not working.
>> In the logs of the standby MGR nodes I found an error (after restarting 
>> service) that pointed to Ceph Dashboard.
>> 
>> Since disabling the dashboard my MGRs are stable again.
>> 
>> Regards
>> Thomas
>> 
>> Am 02.11.2019 um 02:48 schrieb Oliver Freyermuth:
>>> Dear Cephers,
>>> 
>>> interestingly, after:
>>>   ceph device monitoring off
>>> the mgrs seem to be stable now - the active one still went silent a few 
>>> minutes later,
>>> but the standby took over and was stable, and restarting the broken one, 
>>> it's now stable since an hour, too,
>>> so probably, a restart of the mgr is needed after disabling device 
>>> monitoring to get things stable again.
>>> 
>>> So it seems to be caused by a problem with the device health metrics. In 
>>> case this is a red herring and mgrs become instable again in the next days,
>>> I'll let you know.
>>> 
>>> Cheers,
>>> Oliver
>>> 
>>> Am 01.11.19 um 23:09 schrieb Oliver Freyermuth:
 Dear Cephers,
 
 this is a 14.2.4 cluster with device health metrics enabled - since about 
 a day, all mgr daemons go "silent" on me after a few hours, i.e. "ceph -s" 
 shows:
 
cluster:
  id: 269cf2b2-7e7c-4ceb-bd1b-a33d915ceee9
  health: HEALTH_WARN
  no active mgr
  1/3 mons down, quorum mon001,mon002
  services:
  mon:3 daemons, quorum mon001,mon002 (age 57m), out of quorum: 
 mon003
  mgr:no daemons active (since 56m)
  ...
 (the third mon has a planned outage and will come back in a few days)
 
 Checking the logs of the mgr daemons, I find some "reset" messages at the 
 time when it goes "silent", first for the first mgr:
 
 2019-11-01 21:34:40.286 7f2df6a6b700  0 log_channel(cluster) log [DBG] : 
 pgmap v1798: 1585 pgs: 1585 active+clean; 1.1 TiB data, 2.3 TiB used, 136 
 TiB / 138 TiB avail
 2019-11-01 21:34:41.458 7f2e0d59b700  0 client.0 ms_handle_reset on 
 v2:10.160.16.1:6800/401248
 2019-11-01 21:34:42.287 7f2df6a6b700  0 log_channel(cluster) log [DBG] : 
 pgmap v1799: 1585 pgs: 1585 active+clean; 1.1 TiB data, 2.3 TiB used, 136 
 TiB / 138 TiB avail
 
 and a bit later, on the standby mgr:
 
 2019-11-01 22:18:14.892 7f7bcc8ae700  0 log_channel(cluster) log [DBG] : 
 pgmap v1798: 1585 pgs: 166 active+clean+snaptrim, 858 
 active+clean+snaptrim_wait, 561 active+clean; 1.1 TiB data, 2.3 TiB used, 
 136 TiB / 138 TiB avail
 2019-11-01 22:18:16.022 7f7be9e72700  0 client.0 ms_handle_reset on 
 v2:10.160.16.2:6800/352196
 2019-11-01 22:18:16.893 7f7bcc8ae700  0 log_channel(cluster) log [DBG] : 
 pgmap v1799: 1585 pgs: 166 active+clean+snaptrim, 858 
 active+clean+snaptrim_wait, 561 active+clean; 1.1 TiB data, 2.3 TiB used, 
 136 TiB / 138 TiB avail
 
 Interestingly, the dashboard still works, but presents outdated 
 information, and for example zero I/O going on.
 I believe this started to happen mainly after the third mon went into the 
 known downtime, but I am not fully sure if this was the trigger, since the 
 cluster is still growing.
 It may also have been the addition of 24 more OSDs.
 
 
 I also find other messages in the mgr logs which seem problematic, but I 
 am not sure they are related:
 --
 2019-11-01 21:17:09.849 7f2df4266700  0 mgr[devicehealth] Error reading 
 OMAP: [errno 22] Failed to operate read op for oid
 Traceback (most recent call last):
File "/usr/share/ceph/mgr/devicehealth/module.py", line 396, in 
 put_device_metrics
  ioctx.operate_read_op(op, devid)
File "rados.pyx", line 516, in rados.requires.wrapper.validate_func 
 (/home/jenkins-build/build/workspace/ceph-build/AR

[ceph-users] Re: mgr daemons becoming unresponsive

2019-11-02 Thread Oliver Freyermuth
Dear Reed,

yes, also the balancer is on for me - but the instabilities vanished as soon as 
I turned off device health metrics. 

Cheers,
Oliver

Am 02.11.19 um 17:31 schrieb Reed Dier:
> Do you also have the balancer module on?
> 
> I experienced extremely bad stability issues where the MGRs would silently 
> die with the balancer module on.
> And by on, I mean 'active:true` by way of `ceph balancer on`.
> 
> Once I disabled the automatic balancer, it seemed to become much more stable.
> 
> I can still manually run the balancer without issues (except for one pool), 
> but the balancer is what appeared to be my big driver of instability.
> 
> Reed
> 
>> On Nov 2, 2019, at 11:24 AM, Oliver Freyermuth 
>>  wrote:
>>
>> Hi Thomas,
>>
>> indeed, I also had the dashboard open at these times - but right now, after 
>> disabling device health metrics,
>> I can not retrigger it even when playing wildly on the dashboard. 
>>
>> So I'll now reenable health metrics and try to retrigger the issue with 
>> cranked up debug levels as Sage suggested. 
>> Maybe in your case, if you can stand mgr failures, this would also be 
>> interesting to get the dashboard issue debugged? 
>>
>> Cheers,
>> Oliver
>>
>> Am 02.11.19 um 08:23 schrieb Thomas:
>>> Hi Oliver,
>>>
>>> I experienced a situation where MGRs "goes crazy", means MGR was active but 
>>> not working.
>>> In the logs of the standby MGR nodes I found an error (after restarting 
>>> service) that pointed to Ceph Dashboard.
>>>
>>> Since disabling the dashboard my MGRs are stable again.
>>>
>>> Regards
>>> Thomas
>>>
>>> Am 02.11.2019 um 02:48 schrieb Oliver Freyermuth:
 Dear Cephers,

 interestingly, after:
   ceph device monitoring off
 the mgrs seem to be stable now - the active one still went silent a few 
 minutes later,
 but the standby took over and was stable, and restarting the broken one, 
 it's now stable since an hour, too,
 so probably, a restart of the mgr is needed after disabling device 
 monitoring to get things stable again.

 So it seems to be caused by a problem with the device health metrics. In 
 case this is a red herring and mgrs become instable again in the next days,
 I'll let you know.

 Cheers,
 Oliver

 Am 01.11.19 um 23:09 schrieb Oliver Freyermuth:
> Dear Cephers,
>
> this is a 14.2.4 cluster with device health metrics enabled - since about 
> a day, all mgr daemons go "silent" on me after a few hours, i.e. "ceph 
> -s" shows:
>
>cluster:
>  id: 269cf2b2-7e7c-4ceb-bd1b-a33d915ceee9
>  health: HEALTH_WARN
>  no active mgr
>  1/3 mons down, quorum mon001,mon002
>  services:
>  mon:3 daemons, quorum mon001,mon002 (age 57m), out of 
> quorum: mon003
>  mgr:no daemons active (since 56m)
>  ...
> (the third mon has a planned outage and will come back in a few days)
>
> Checking the logs of the mgr daemons, I find some "reset" messages at the 
> time when it goes "silent", first for the first mgr:
>
> 2019-11-01 21:34:40.286 7f2df6a6b700  0 log_channel(cluster) log [DBG] : 
> pgmap v1798: 1585 pgs: 1585 active+clean; 1.1 TiB data, 2.3 TiB used, 136 
> TiB / 138 TiB avail
> 2019-11-01 21:34:41.458 7f2e0d59b700  0 client.0 ms_handle_reset on 
> v2:10.160.16.1:6800/401248
> 2019-11-01 21:34:42.287 7f2df6a6b700  0 log_channel(cluster) log [DBG] : 
> pgmap v1799: 1585 pgs: 1585 active+clean; 1.1 TiB data, 2.3 TiB used, 136 
> TiB / 138 TiB avail
>
> and a bit later, on the standby mgr:
>
> 2019-11-01 22:18:14.892 7f7bcc8ae700  0 log_channel(cluster) log [DBG] : 
> pgmap v1798: 1585 pgs: 166 active+clean+snaptrim, 858 
> active+clean+snaptrim_wait, 561 active+clean; 1.1 TiB data, 2.3 TiB used, 
> 136 TiB / 138 TiB avail
> 2019-11-01 22:18:16.022 7f7be9e72700  0 client.0 ms_handle_reset on 
> v2:10.160.16.2:6800/352196
> 2019-11-01 22:18:16.893 7f7bcc8ae700  0 log_channel(cluster) log [DBG] : 
> pgmap v1799: 1585 pgs: 166 active+clean+snaptrim, 858 
> active+clean+snaptrim_wait, 561 active+clean; 1.1 TiB data, 2.3 TiB used, 
> 136 TiB / 138 TiB avail
>
> Interestingly, the dashboard still works, but presents outdated 
> information, and for example zero I/O going on.
> I believe this started to happen mainly after the third mon went into the 
> known downtime, but I am not fully sure if this was the trigger, since 
> the cluster is still growing.
> It may also have been the addition of 24 more OSDs.
>
>
> I also find other messages in the mgr logs which seem problematic, but I 
> am not sure they are related:
> --
> 2019-11-01 21:17:09.849 7f2df4266700  0 mgr[devicehealth] Error reading 
> OMAP: [errno 22] Failed to operate read op for oid
> Traceback

[ceph-users] Re: mgr daemons becoming unresponsive

2019-11-02 Thread Oliver Freyermuth
Dear Sage,

at least for the simple case:
 ceph device get-health-metrics osd.11
=> mgr crashes (but in that case, it crashes fully, i.e. the process is gone)
I have now uploaded a verbose log as:
ceph-post-file: e3bd60ad-cbce-4308-8b07-7ebe7998572e

One potential cause of this (and maybe the other issues) might be because some 
of our OSDs are on non-JBOD controllers and hence are made by forming a Raid 0 
per disk,
so a simple smartctl on the device will not work (but -dmegaraid, would 
be needed). 

Now I have both mgrs active again, debug logging on, device health metrics on 
again,
and am waiting for them to become silent again. Let's hope the issue reappears 
before the disks run full of logs ;-). 

Cheers,
Oliver

Am 02.11.19 um 02:56 schrieb Sage Weil:
> On Sat, 2 Nov 2019, Oliver Freyermuth wrote:
>> Dear Cephers,
>>
>> interestingly, after:
>>  ceph device monitoring off
>> the mgrs seem to be stable now - the active one still went silent a few 
>> minutes later,
>> but the standby took over and was stable, and restarting the broken one, 
>> it's now stable since an hour, too,
>> so probably, a restart of the mgr is needed after disabling device 
>> monitoring to get things stable again. 
>>
>> So it seems to be caused by a problem with the device health metrics. In 
>> case this is a red herring and mgrs become instable again in the next days,
>> I'll let you know. 
> 
> If this seems to stabilize things, and you can tolerate inducing the 
> failure again, reproducing the problem with mgr logs cranked up (debug_mgr 
> = 20, debug_ms = 1) would probably give us a good idea of why the mgr is 
> hanging.  Let us know!
> 
> Thanks,
> sage
> 
>  > 
>> Cheers,
>>  Oliver
>>
>> Am 01.11.19 um 23:09 schrieb Oliver Freyermuth:
>>> Dear Cephers,
>>>
>>> this is a 14.2.4 cluster with device health metrics enabled - since about a 
>>> day, all mgr daemons go "silent" on me after a few hours, i.e. "ceph -s" 
>>> shows:
>>>
>>>   cluster:
>>> id: 269cf2b2-7e7c-4ceb-bd1b-a33d915ceee9
>>> health: HEALTH_WARN
>>> no active mgr
>>> 1/3 mons down, quorum mon001,mon002
>>>  
>>>   services:
>>> mon:3 daemons, quorum mon001,mon002 (age 57m), out of quorum: 
>>> mon003
>>> mgr:no daemons active (since 56m)
>>> ...
>>> (the third mon has a planned outage and will come back in a few days)
>>>
>>> Checking the logs of the mgr daemons, I find some "reset" messages at the 
>>> time when it goes "silent", first for the first mgr:
>>>
>>> 2019-11-01 21:34:40.286 7f2df6a6b700  0 log_channel(cluster) log [DBG] : 
>>> pgmap v1798: 1585 pgs: 1585 active+clean; 1.1 TiB data, 2.3 TiB used, 136 
>>> TiB / 138 TiB avail
>>> 2019-11-01 21:34:41.458 7f2e0d59b700  0 client.0 ms_handle_reset on 
>>> v2:10.160.16.1:6800/401248
>>> 2019-11-01 21:34:42.287 7f2df6a6b700  0 log_channel(cluster) log [DBG] : 
>>> pgmap v1799: 1585 pgs: 1585 active+clean; 1.1 TiB data, 2.3 TiB used, 136 
>>> TiB / 138 TiB avail
>>>
>>> and a bit later, on the standby mgr:
>>>
>>> 2019-11-01 22:18:14.892 7f7bcc8ae700  0 log_channel(cluster) log [DBG] : 
>>> pgmap v1798: 1585 pgs: 166 active+clean+snaptrim, 858 
>>> active+clean+snaptrim_wait, 561 active+clean; 1.1 TiB data, 2.3 TiB used, 
>>> 136 TiB / 138 TiB avail
>>> 2019-11-01 22:18:16.022 7f7be9e72700  0 client.0 ms_handle_reset on 
>>> v2:10.160.16.2:6800/352196
>>> 2019-11-01 22:18:16.893 7f7bcc8ae700  0 log_channel(cluster) log [DBG] : 
>>> pgmap v1799: 1585 pgs: 166 active+clean+snaptrim, 858 
>>> active+clean+snaptrim_wait, 561 active+clean; 1.1 TiB data, 2.3 TiB used, 
>>> 136 TiB / 138 TiB avail
>>>
>>> Interestingly, the dashboard still works, but presents outdated 
>>> information, and for example zero I/O going on. 
>>> I believe this started to happen mainly after the third mon went into the 
>>> known downtime, but I am not fully sure if this was the trigger, since the 
>>> cluster is still growing. 
>>> It may also have been the addition of 24 more OSDs. 
>>>
>>>
>>> I also find other messages in the mgr logs which seem problematic, but I am 
>>> not sure they are related:
>>> --
>>> 2019-11-01 21:17:09.849 7f2df4266700  0 mgr[devicehealth] Error reading 
>>> OMAP: [errno 22] Failed to operate read op for oid 
>>> Traceback (most recent call last):
>>>   File "/usr/share/ceph/mgr/devicehealth/module.py", line 396, in 
>>> put_device_metrics
>>> ioctx.operate_read_op(op, devid)
>>>   File "rados.pyx", line 516, in rados.requires.wrapper.validate_func 
>>> (/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.4/rpm/el7/BUIL
>>> D/ceph-14.2.4/build/src/pybind/rados/pyrex/rados.c:4721)
>>>   File "rados.pyx", line 3474, in rados.Ioctx.operate_read_op 
>>> (/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/1

[ceph-users] Re: mgr daemons becoming unresponsive

2019-11-02 Thread Oliver Freyermuth
Dear Sage,

good news - it happened again, with debug logs! 
There's nothing obvious to my eye, it's uploaded as:
0b2d0c09-46f3-4126-aa27-e2d2e8572741
It seems the failure was roughly in parallel to me wanting to access the 
dashboard. It must have happened within the last ~5-10 minutes of the log. 

I'll now go back to "stable operation", in case you need anything else, just 
let me know. 

Cheers and all the best,
Oliver

Am 02.11.19 um 17:38 schrieb Oliver Freyermuth:
> Dear Sage,
> 
> at least for the simple case:
>  ceph device get-health-metrics osd.11
> => mgr crashes (but in that case, it crashes fully, i.e. the process is gone)
> I have now uploaded a verbose log as:
> ceph-post-file: e3bd60ad-cbce-4308-8b07-7ebe7998572e
> 
> One potential cause of this (and maybe the other issues) might be because 
> some of our OSDs are on non-JBOD controllers and hence are made by forming a 
> Raid 0 per disk,
> so a simple smartctl on the device will not work (but -dmegaraid, 
> would be needed). 
> 
> Now I have both mgrs active again, debug logging on, device health metrics on 
> again,
> and am waiting for them to become silent again. Let's hope the issue 
> reappears before the disks run full of logs ;-). 
> 
> Cheers,
>   Oliver
> 
> Am 02.11.19 um 02:56 schrieb Sage Weil:
>> On Sat, 2 Nov 2019, Oliver Freyermuth wrote:
>>> Dear Cephers,
>>>
>>> interestingly, after:
>>>  ceph device monitoring off
>>> the mgrs seem to be stable now - the active one still went silent a few 
>>> minutes later,
>>> but the standby took over and was stable, and restarting the broken one, 
>>> it's now stable since an hour, too,
>>> so probably, a restart of the mgr is needed after disabling device 
>>> monitoring to get things stable again. 
>>>
>>> So it seems to be caused by a problem with the device health metrics. In 
>>> case this is a red herring and mgrs become instable again in the next days,
>>> I'll let you know. 
>>
>> If this seems to stabilize things, and you can tolerate inducing the 
>> failure again, reproducing the problem with mgr logs cranked up (debug_mgr 
>> = 20, debug_ms = 1) would probably give us a good idea of why the mgr is 
>> hanging.  Let us know!
>>
>> Thanks,
>> sage
>>
>>  > 
>>> Cheers,
>>> Oliver
>>>
>>> Am 01.11.19 um 23:09 schrieb Oliver Freyermuth:
 Dear Cephers,

 this is a 14.2.4 cluster with device health metrics enabled - since about 
 a day, all mgr daemons go "silent" on me after a few hours, i.e. "ceph -s" 
 shows:

   cluster:
 id: 269cf2b2-7e7c-4ceb-bd1b-a33d915ceee9
 health: HEALTH_WARN
 no active mgr
 1/3 mons down, quorum mon001,mon002
  
   services:
 mon:3 daemons, quorum mon001,mon002 (age 57m), out of quorum: 
 mon003
 mgr:no daemons active (since 56m)
 ...
 (the third mon has a planned outage and will come back in a few days)

 Checking the logs of the mgr daemons, I find some "reset" messages at the 
 time when it goes "silent", first for the first mgr:

 2019-11-01 21:34:40.286 7f2df6a6b700  0 log_channel(cluster) log [DBG] : 
 pgmap v1798: 1585 pgs: 1585 active+clean; 1.1 TiB data, 2.3 TiB used, 136 
 TiB / 138 TiB avail
 2019-11-01 21:34:41.458 7f2e0d59b700  0 client.0 ms_handle_reset on 
 v2:10.160.16.1:6800/401248
 2019-11-01 21:34:42.287 7f2df6a6b700  0 log_channel(cluster) log [DBG] : 
 pgmap v1799: 1585 pgs: 1585 active+clean; 1.1 TiB data, 2.3 TiB used, 136 
 TiB / 138 TiB avail

 and a bit later, on the standby mgr:

 2019-11-01 22:18:14.892 7f7bcc8ae700  0 log_channel(cluster) log [DBG] : 
 pgmap v1798: 1585 pgs: 166 active+clean+snaptrim, 858 
 active+clean+snaptrim_wait, 561 active+clean; 1.1 TiB data, 2.3 TiB used, 
 136 TiB / 138 TiB avail
 2019-11-01 22:18:16.022 7f7be9e72700  0 client.0 ms_handle_reset on 
 v2:10.160.16.2:6800/352196
 2019-11-01 22:18:16.893 7f7bcc8ae700  0 log_channel(cluster) log [DBG] : 
 pgmap v1799: 1585 pgs: 166 active+clean+snaptrim, 858 
 active+clean+snaptrim_wait, 561 active+clean; 1.1 TiB data, 2.3 TiB used, 
 136 TiB / 138 TiB avail

 Interestingly, the dashboard still works, but presents outdated 
 information, and for example zero I/O going on. 
 I believe this started to happen mainly after the third mon went into the 
 known downtime, but I am not fully sure if this was the trigger, since the 
 cluster is still growing. 
 It may also have been the addition of 24 more OSDs. 


 I also find other messages in the mgr logs which seem problematic, but I 
 am not sure they are related:
 --
 2019-11-01 21:17:09.849 7f2df4266700  0 mgr[devicehealth] Error reading 
 OMAP: [errno 22] Failed to operate read op for oid 
 Traceback (most recent call last):
   File "/usr/share/ceph/mgr/devicehealt

[ceph-users] Re: mgr daemons becoming unresponsive

2019-11-02 Thread Janek Bevendorff
These issues sound a bit like a bug I reported a few days ago:
https://tracker.ceph.com/issues/39264


Related: https://tracker.ceph.com/issues/39264


On 02/11/2019 17:34, Oliver Freyermuth wrote:
> Dear Reed,
>
> yes, also the balancer is on for me - but the instabilities vanished as soon 
> as I turned off device health metrics. 
>
> Cheers,
> Oliver
>
> Am 02.11.19 um 17:31 schrieb Reed Dier:
>> Do you also have the balancer module on?
>>
>> I experienced extremely bad stability issues where the MGRs would silently 
>> die with the balancer module on.
>> And by on, I mean 'active:true` by way of `ceph balancer on`.
>>
>> Once I disabled the automatic balancer, it seemed to become much more stable.
>>
>> I can still manually run the balancer without issues (except for one pool), 
>> but the balancer is what appeared to be my big driver of instability.
>>
>> Reed
>>
>>> On Nov 2, 2019, at 11:24 AM, Oliver Freyermuth 
>>>  wrote:
>>>
>>> Hi Thomas,
>>>
>>> indeed, I also had the dashboard open at these times - but right now, after 
>>> disabling device health metrics,
>>> I can not retrigger it even when playing wildly on the dashboard. 
>>>
>>> So I'll now reenable health metrics and try to retrigger the issue with 
>>> cranked up debug levels as Sage suggested. 
>>> Maybe in your case, if you can stand mgr failures, this would also be 
>>> interesting to get the dashboard issue debugged? 
>>>
>>> Cheers,
>>> Oliver
>>>
>>> Am 02.11.19 um 08:23 schrieb Thomas:
 Hi Oliver,

 I experienced a situation where MGRs "goes crazy", means MGR was active 
 but not working.
 In the logs of the standby MGR nodes I found an error (after restarting 
 service) that pointed to Ceph Dashboard.

 Since disabling the dashboard my MGRs are stable again.

 Regards
 Thomas

 Am 02.11.2019 um 02:48 schrieb Oliver Freyermuth:
> Dear Cephers,
>
> interestingly, after:
>   ceph device monitoring off
> the mgrs seem to be stable now - the active one still went silent a few 
> minutes later,
> but the standby took over and was stable, and restarting the broken one, 
> it's now stable since an hour, too,
> so probably, a restart of the mgr is needed after disabling device 
> monitoring to get things stable again.
>
> So it seems to be caused by a problem with the device health metrics. In 
> case this is a red herring and mgrs become instable again in the next 
> days,
> I'll let you know.
>
> Cheers,
> Oliver
>
> Am 01.11.19 um 23:09 schrieb Oliver Freyermuth:
>> Dear Cephers,
>>
>> this is a 14.2.4 cluster with device health metrics enabled - since 
>> about a day, all mgr daemons go "silent" on me after a few hours, i.e. 
>> "ceph -s" shows:
>>
>>cluster:
>>  id: 269cf2b2-7e7c-4ceb-bd1b-a33d915ceee9
>>  health: HEALTH_WARN
>>  no active mgr
>>  1/3 mons down, quorum mon001,mon002
>>  services:
>>  mon:3 daemons, quorum mon001,mon002 (age 57m), out of 
>> quorum: mon003
>>  mgr:no daemons active (since 56m)
>>  ...
>> (the third mon has a planned outage and will come back in a few days)
>>
>> Checking the logs of the mgr daemons, I find some "reset" messages at 
>> the time when it goes "silent", first for the first mgr:
>>
>> 2019-11-01 21:34:40.286 7f2df6a6b700  0 log_channel(cluster) log [DBG] : 
>> pgmap v1798: 1585 pgs: 1585 active+clean; 1.1 TiB data, 2.3 TiB used, 
>> 136 TiB / 138 TiB avail
>> 2019-11-01 21:34:41.458 7f2e0d59b700  0 client.0 ms_handle_reset on 
>> v2:10.160.16.1:6800/401248
>> 2019-11-01 21:34:42.287 7f2df6a6b700  0 log_channel(cluster) log [DBG] : 
>> pgmap v1799: 1585 pgs: 1585 active+clean; 1.1 TiB data, 2.3 TiB used, 
>> 136 TiB / 138 TiB avail
>>
>> and a bit later, on the standby mgr:
>>
>> 2019-11-01 22:18:14.892 7f7bcc8ae700  0 log_channel(cluster) log [DBG] : 
>> pgmap v1798: 1585 pgs: 166 active+clean+snaptrim, 858 
>> active+clean+snaptrim_wait, 561 active+clean; 1.1 TiB data, 2.3 TiB 
>> used, 136 TiB / 138 TiB avail
>> 2019-11-01 22:18:16.022 7f7be9e72700  0 client.0 ms_handle_reset on 
>> v2:10.160.16.2:6800/352196
>> 2019-11-01 22:18:16.893 7f7bcc8ae700  0 log_channel(cluster) log [DBG] : 
>> pgmap v1799: 1585 pgs: 166 active+clean+snaptrim, 858 
>> active+clean+snaptrim_wait, 561 active+clean; 1.1 TiB data, 2.3 TiB 
>> used, 136 TiB / 138 TiB avail
>>
>> Interestingly, the dashboard still works, but presents outdated 
>> information, and for example zero I/O going on.
>> I believe this started to happen mainly after the third mon went into 
>> the known downtime, but I am not fully sure if this was

[ceph-users] Re: mgr daemons becoming unresponsive

2019-11-02 Thread Oliver Freyermuth
Dear Janek,

in my case, the mgr daemon itself remains "running", it just stops reporting to 
the mon. It even still serves the dashboard, but with outdated information. 
I grepped through the logs and could not find any clock skew messages. So it 
seems to be a different issue
(albeit both issues seem to be triggered by the devicehealth module). 

Cheers,
Oliver

On 2019-11-02 18:28, Janek Bevendorff wrote:
> These issues sound a bit like a bug I reported a few days ago: 
> https://tracker.ceph.com/issues/39264 
> 
> 
> Related: https://tracker.ceph.com/issues/39264 
> 
> 
> On 02/11/2019 17:34, Oliver Freyermuth wrote:
>> Dear Reed,
>>
>> yes, also the balancer is on for me - but the instabilities vanished as soon 
>> as I turned off device health metrics. 
>>
>> Cheers,
>> Oliver
>>
>> Am 02.11.19 um 17:31 schrieb Reed Dier:
>>> Do you also have the balancer module on?
>>>
>>> I experienced extremely bad stability issues where the MGRs would silently 
>>> die with the balancer module on.
>>> And by on, I mean 'active:true` by way of `ceph balancer on`.
>>>
>>> Once I disabled the automatic balancer, it seemed to become much more 
>>> stable.
>>>
>>> I can still manually run the balancer without issues (except for one pool), 
>>> but the balancer is what appeared to be my big driver of instability.
>>>
>>> Reed
>>>
 On Nov 2, 2019, at 11:24 AM, Oliver Freyermuth 
  wrote:

 Hi Thomas,

 indeed, I also had the dashboard open at these times - but right now, 
 after disabling device health metrics,
 I can not retrigger it even when playing wildly on the dashboard. 

 So I'll now reenable health metrics and try to retrigger the issue with 
 cranked up debug levels as Sage suggested. 
 Maybe in your case, if you can stand mgr failures, this would also be 
 interesting to get the dashboard issue debugged? 

 Cheers,
 Oliver

 Am 02.11.19 um 08:23 schrieb Thomas:
> Hi Oliver,
>
> I experienced a situation where MGRs "goes crazy", means MGR was active 
> but not working.
> In the logs of the standby MGR nodes I found an error (after restarting 
> service) that pointed to Ceph Dashboard.
>
> Since disabling the dashboard my MGRs are stable again.
>
> Regards
> Thomas
>
> Am 02.11.2019 um 02:48 schrieb Oliver Freyermuth:
>> Dear Cephers,
>>
>> interestingly, after:
>>   ceph device monitoring off
>> the mgrs seem to be stable now - the active one still went silent a few 
>> minutes later,
>> but the standby took over and was stable, and restarting the broken one, 
>> it's now stable since an hour, too,
>> so probably, a restart of the mgr is needed after disabling device 
>> monitoring to get things stable again.
>>
>> So it seems to be caused by a problem with the device health metrics. In 
>> case this is a red herring and mgrs become instable again in the next 
>> days,
>> I'll let you know.
>>
>> Cheers,
>> Oliver
>>
>> Am 01.11.19 um 23:09 schrieb Oliver Freyermuth:
>>> Dear Cephers,
>>>
>>> this is a 14.2.4 cluster with device health metrics enabled - since 
>>> about a day, all mgr daemons go "silent" on me after a few hours, i.e. 
>>> "ceph -s" shows:
>>>
>>>cluster:
>>>  id: 269cf2b2-7e7c-4ceb-bd1b-a33d915ceee9
>>>  health: HEALTH_WARN
>>>  no active mgr
>>>  1/3 mons down, quorum mon001,mon002
>>>  services:
>>>  mon:3 daemons, quorum mon001,mon002 (age 57m), out of 
>>> quorum: mon003
>>>  mgr:no daemons active (since 56m)
>>>  ...
>>> (the third mon has a planned outage and will come back in a few days)
>>>
>>> Checking the logs of the mgr daemons, I find some "reset" messages at 
>>> the time when it goes "silent", first for the first mgr:
>>>
>>> 2019-11-01 21:34:40.286 7f2df6a6b700  0 log_channel(cluster) log [DBG] 
>>> : pgmap v1798: 1585 pgs: 1585 active+clean; 1.1 TiB data, 2.3 TiB used, 
>>> 136 TiB / 138 TiB avail
>>> 2019-11-01 21:34:41.458 7f2e0d59b700  0 client.0 ms_handle_reset on 
>>> v2:10.160.16.1:6800/401248
>>> 2019-11-01 21:34:42.287 7f2df6a6b700  0 log_channel(cluster) log [DBG] 
>>> : pgmap v1799: 1585 pgs: 1585 active+clean; 1.1 TiB data, 2.3 TiB used, 
>>> 136 TiB / 138 TiB avail
>>>
>>> and a bit later, on the standby mgr:
>>>
>>> 2019-11-01 22:18:14.892 7f7bcc8ae700  0 log_channel(cluster) log [DBG] 
>>> : pgmap v1798: 1585 pgs: 166 active+clean+snaptrim, 858 
>>> active+clean+snaptrim_wait, 561 active+clean; 1.1 TiB data, 2.3 TiB 
>>> used, 136 TiB / 138 TiB avail
>>> 2019-11-01 22:18:16.022 7f7be9e72700  0 client.0 ms_handle_reset on 
>>> v2

[ceph-users] Re: mgr daemons becoming unresponsive

2019-11-02 Thread Thomas

Hi,

I experience major issues with MGR and by chance my drives are on 
non-JBOD controllers, too (like Oliver's drives).


Regards
Thomas

Am 02.11.2019 um 17:38 schrieb Oliver Freyermuth:

Dear Sage,

at least for the simple case:
  ceph device get-health-metrics osd.11
=> mgr crashes (but in that case, it crashes fully, i.e. the process is gone)
I have now uploaded a verbose log as:
ceph-post-file: e3bd60ad-cbce-4308-8b07-7ebe7998572e

One potential cause of this (and maybe the other issues) might be because some 
of our OSDs are on non-JBOD controllers and hence are made by forming a Raid 0 
per disk,
so a simple smartctl on the device will not work (but -dmegaraid, would 
be needed).

Now I have both mgrs active again, debug logging on, device health metrics on 
again,
and am waiting for them to become silent again. Let's hope the issue reappears 
before the disks run full of logs ;-).

Cheers,
Oliver

Am 02.11.19 um 02:56 schrieb Sage Weil:

On Sat, 2 Nov 2019, Oliver Freyermuth wrote:

Dear Cephers,

interestingly, after:
  ceph device monitoring off
the mgrs seem to be stable now - the active one still went silent a few minutes 
later,
but the standby took over and was stable, and restarting the broken one, it's 
now stable since an hour, too,
so probably, a restart of the mgr is needed after disabling device monitoring 
to get things stable again.

So it seems to be caused by a problem with the device health metrics. In case 
this is a red herring and mgrs become instable again in the next days,
I'll let you know.

If this seems to stabilize things, and you can tolerate inducing the
failure again, reproducing the problem with mgr logs cranked up (debug_mgr
= 20, debug_ms = 1) would probably give us a good idea of why the mgr is
hanging.  Let us know!

Thanks,
sage

  >

Cheers,
Oliver

Am 01.11.19 um 23:09 schrieb Oliver Freyermuth:

Dear Cephers,

this is a 14.2.4 cluster with device health metrics enabled - since about a day, all mgr daemons go 
"silent" on me after a few hours, i.e. "ceph -s" shows:

   cluster:
 id: 269cf2b2-7e7c-4ceb-bd1b-a33d915ceee9
 health: HEALTH_WARN
 no active mgr
 1/3 mons down, quorum mon001,mon002
  
   services:

 mon:3 daemons, quorum mon001,mon002 (age 57m), out of quorum: 
mon003
 mgr:no daemons active (since 56m)
 ...
(the third mon has a planned outage and will come back in a few days)

Checking the logs of the mgr daemons, I find some "reset" messages at the time when it 
goes "silent", first for the first mgr:

2019-11-01 21:34:40.286 7f2df6a6b700  0 log_channel(cluster) log [DBG] : pgmap 
v1798: 1585 pgs: 1585 active+clean; 1.1 TiB data, 2.3 TiB used, 136 TiB / 138 
TiB avail
2019-11-01 21:34:41.458 7f2e0d59b700  0 client.0 ms_handle_reset on 
v2:10.160.16.1:6800/401248
2019-11-01 21:34:42.287 7f2df6a6b700  0 log_channel(cluster) log [DBG] : pgmap 
v1799: 1585 pgs: 1585 active+clean; 1.1 TiB data, 2.3 TiB used, 136 TiB / 138 
TiB avail

and a bit later, on the standby mgr:

2019-11-01 22:18:14.892 7f7bcc8ae700  0 log_channel(cluster) log [DBG] : pgmap 
v1798: 1585 pgs: 166 active+clean+snaptrim, 858 active+clean+snaptrim_wait, 561 
active+clean; 1.1 TiB data, 2.3 TiB used, 136 TiB / 138 TiB avail
2019-11-01 22:18:16.022 7f7be9e72700  0 client.0 ms_handle_reset on 
v2:10.160.16.2:6800/352196
2019-11-01 22:18:16.893 7f7bcc8ae700  0 log_channel(cluster) log [DBG] : pgmap 
v1799: 1585 pgs: 166 active+clean+snaptrim, 858 active+clean+snaptrim_wait, 561 
active+clean; 1.1 TiB data, 2.3 TiB used, 136 TiB / 138 TiB avail

Interestingly, the dashboard still works, but presents outdated information, 
and for example zero I/O going on.
I believe this started to happen mainly after the third mon went into the known 
downtime, but I am not fully sure if this was the trigger, since the cluster is 
still growing.
It may also have been the addition of 24 more OSDs.


I also find other messages in the mgr logs which seem problematic, but I am not 
sure they are related:
--
2019-11-01 21:17:09.849 7f2df4266700  0 mgr[devicehealth] Error reading OMAP: 
[errno 22] Failed to operate read op for oid
Traceback (most recent call last):
   File "/usr/share/ceph/mgr/devicehealth/module.py", line 396, in 
put_device_metrics
 ioctx.operate_read_op(op, devid)
   File "rados.pyx", line 516, in rados.requires.wrapper.validate_func 
(/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.4/rpm/el7/BUIL
D/ceph-14.2.4/build/src/pybind/rados/pyrex/rados.c:4721)
   File "rados.pyx", line 3474, in rados.Ioctx.operate_read_op 
(/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.4/rpm/el7/BUILD/ceph-14.2.4/build/src/pybind/rados/pyrex/rados.c:36554)
InvalidArgumentError: [errno 22] Failed to operate

[ceph-users] Re: mgr daemons becoming unresponsive

2019-11-02 Thread Thomas

Hi,

on the error log of my active MGR I find these errors after some time:
2019-11-02 19:07:30.629 7f448f1cb700  0 auth: could not find secret_id=3769
2019-11-02 19:07:30.629 7f448f1cb700  0 cephx: verify_authorizer could 
not get service secret for service mgr secret_id=3769

2019-11-02 19:07:30.821 7f448f1cb700  0 auth: could not find secret_id=3768
2019-11-02 19:07:30.821 7f448f1cb700  0 cephx: verify_authorizer could 
not get service secret for service mgr secret_id=3768

2019-11-02 19:07:31.597 7f448f1cb700  0 auth: could not find secret_id=3768
2019-11-02 19:07:31.597 7f448f1cb700  0 cephx: verify_authorizer could 
not get service secret for service mgr secret_id=3768


Once these errors occur the MGR is not working, means there's no PG 
activity in ceph -s although the cluster health status is HEALTH_ERR.


Regards
Thomas

Am 02.11.2019 um 17:38 schrieb Oliver Freyermuth:

Dear Sage,

at least for the simple case:
  ceph device get-health-metrics osd.11
=> mgr crashes (but in that case, it crashes fully, i.e. the process is gone)
I have now uploaded a verbose log as:
ceph-post-file: e3bd60ad-cbce-4308-8b07-7ebe7998572e

One potential cause of this (and maybe the other issues) might be because some 
of our OSDs are on non-JBOD controllers and hence are made by forming a Raid 0 
per disk,
so a simple smartctl on the device will not work (but -dmegaraid, would 
be needed).

Now I have both mgrs active again, debug logging on, device health metrics on 
again,
and am waiting for them to become silent again. Let's hope the issue reappears 
before the disks run full of logs ;-).

Cheers,
Oliver

Am 02.11.19 um 02:56 schrieb Sage Weil:

On Sat, 2 Nov 2019, Oliver Freyermuth wrote:

Dear Cephers,

interestingly, after:
  ceph device monitoring off
the mgrs seem to be stable now - the active one still went silent a few minutes 
later,
but the standby took over and was stable, and restarting the broken one, it's 
now stable since an hour, too,
so probably, a restart of the mgr is needed after disabling device monitoring 
to get things stable again.

So it seems to be caused by a problem with the device health metrics. In case 
this is a red herring and mgrs become instable again in the next days,
I'll let you know.

If this seems to stabilize things, and you can tolerate inducing the
failure again, reproducing the problem with mgr logs cranked up (debug_mgr
= 20, debug_ms = 1) would probably give us a good idea of why the mgr is
hanging.  Let us know!

Thanks,
sage

  >

Cheers,
Oliver

Am 01.11.19 um 23:09 schrieb Oliver Freyermuth:

Dear Cephers,

this is a 14.2.4 cluster with device health metrics enabled - since about a day, all mgr daemons go 
"silent" on me after a few hours, i.e. "ceph -s" shows:

   cluster:
 id: 269cf2b2-7e7c-4ceb-bd1b-a33d915ceee9
 health: HEALTH_WARN
 no active mgr
 1/3 mons down, quorum mon001,mon002
  
   services:

 mon:3 daemons, quorum mon001,mon002 (age 57m), out of quorum: 
mon003
 mgr:no daemons active (since 56m)
 ...
(the third mon has a planned outage and will come back in a few days)

Checking the logs of the mgr daemons, I find some "reset" messages at the time when it 
goes "silent", first for the first mgr:

2019-11-01 21:34:40.286 7f2df6a6b700  0 log_channel(cluster) log [DBG] : pgmap 
v1798: 1585 pgs: 1585 active+clean; 1.1 TiB data, 2.3 TiB used, 136 TiB / 138 
TiB avail
2019-11-01 21:34:41.458 7f2e0d59b700  0 client.0 ms_handle_reset on 
v2:10.160.16.1:6800/401248
2019-11-01 21:34:42.287 7f2df6a6b700  0 log_channel(cluster) log [DBG] : pgmap 
v1799: 1585 pgs: 1585 active+clean; 1.1 TiB data, 2.3 TiB used, 136 TiB / 138 
TiB avail

and a bit later, on the standby mgr:

2019-11-01 22:18:14.892 7f7bcc8ae700  0 log_channel(cluster) log [DBG] : pgmap 
v1798: 1585 pgs: 166 active+clean+snaptrim, 858 active+clean+snaptrim_wait, 561 
active+clean; 1.1 TiB data, 2.3 TiB used, 136 TiB / 138 TiB avail
2019-11-01 22:18:16.022 7f7be9e72700  0 client.0 ms_handle_reset on 
v2:10.160.16.2:6800/352196
2019-11-01 22:18:16.893 7f7bcc8ae700  0 log_channel(cluster) log [DBG] : pgmap 
v1799: 1585 pgs: 166 active+clean+snaptrim, 858 active+clean+snaptrim_wait, 561 
active+clean; 1.1 TiB data, 2.3 TiB used, 136 TiB / 138 TiB avail

Interestingly, the dashboard still works, but presents outdated information, 
and for example zero I/O going on.
I believe this started to happen mainly after the third mon went into the known 
downtime, but I am not fully sure if this was the trigger, since the cluster is 
still growing.
It may also have been the addition of 24 more OSDs.


I also find other messages in the mgr logs which seem problematic, but I am not 
sure they are related:
--
2019-11-01 21:17:09.849 7f2df4266700  0 mgr[devicehealth] Error reading OMAP: 
[errno 22] Failed to operate read op for oid
Traceback (most recent call last):
   File "/usr/share/ceph/mgr/devicehealth/module.p

[ceph-users] Device Health Metrics on EL 7

2019-11-02 Thread Oliver Freyermuth
Dear Cephers,

I went through some of the OSD logs of our 14.2.4 nodes and found this:
--
Nov 01 01:22:25  sudo[1087697]: ceph : TTY=unknown ; PWD=/ ; USER=root ; 
COMMAND=/sbin/smartctl -a --json /dev/sds
Nov 01 01:22:51  sudo[1087729]: pam_unix(sudo:auth): conversation failed
Nov 01 01:22:51  sudo[1087729]: pam_unix(sudo:auth): auth could not identify 
password for [ceph]
Nov 01 01:22:51  sudo[1087729]: pam_succeed_if(sudo:auth): requirement "uid >= 
1000" not met by user "ceph"
Nov 01 01:22:53  sudo[1087729]: ceph : command not allowed ; TTY=unknown ; 
PWD=/ ; USER=root ; COMMAND=nvme lvm smart-log-add --json /dev/sds
--
It seems with device health metrics, the OSDs try to run smartctl with "sudo", 
which expectedly fails, since the Ceph user (as system user) has a uid smaller 
than 1000. 
Also, it's of course not in /etc/sudoers. 

Does somebody have a working setup with device health metrics which could be 
shared (and documented, or made part of future packaging ;-) ) ? 

Cheers,
Oliver



smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io