Am 30.05.2018 um 10:37 schrieb Yan, Zheng:
> On Wed, May 30, 2018 at 3:04 PM, Oliver Freyermuth
> <> wrote:
>> Hi,
>> ij our case, there's only a single active MDS
>> (+1 standby-replay + 1 standby).
>> We also get the health warning in case it happens.
> Were there " isn't responding to mclientcaps(revoke)"
> warnings in cluster log.  please send them to me if there were.

Yes, indeed, I almost missed them!

Here you go:

2018-05-29 12:16:02.491186 mon.mon003 mon.0 11177 : cluster 
[WRN] MDS health message (mds.0): Client XXXXXXX:XXXXXXX failing to respond to 
capability release
2018-05-29 12:16:03.401014 mon.mon003 mon.0 11178 : cluster 
[WRN] Health check failed: 1 clients failing to respond to capability release 
2018-05-29 12:16:00.567520 mds.mon001 mds.0 15745 
: cluster [WRN] client.1524813 isn't responding to mclientcaps(revoke), ino 
0x10000388ae0 pending pAsLsXsFr issued pAsLsXsFrw, sent 63.908382 seconds ago
>repetition of message with increasing delays in between>
2018-05-29 16:31:00.899416 mds.mon001 mds.0 17169 
: cluster [WRN] client.1524813 isn't responding to mclientcaps(revoke), ino 
0x10000388ae0 pending pAsLsXsFr issued pAsLsXsFrw, sent 15364.240272 seconds ago

After evicting the client, I also get:
2018-05-29 17:00:00.000134 mon.mon003 mon.0 11293 : cluster 
[WRN] overall HEALTH_WARN 1 clients failing to respond to capability release; 1 
MDSs report slow requests
2018-05-29 17:09:50.964730 mon.mon003 mon.0 11297 : cluster 
[INF] MDS health message cleared (mds.0): Client XXXXXXX:XXXXXXX failing to 
respond to capability release
2018-05-29 17:09:50.964767 mon.mon003 mon.0 11298 : cluster 
[INF] MDS health message cleared (mds.0): 123 slow requests are blocked > 30 sec
2018-05-29 17:09:51.015071 mon.mon003 mon.0 11299 : cluster 
[INF] Health check cleared: MDS_CLIENT_LATE_RELEASE (was: 1 clients failing to 
respond to capability release)
2018-05-29 17:09:51.015154 mon.mon003 mon.0 11300 : cluster 
[INF] Health check cleared: MDS_SLOW_REQUEST (was: 1 MDSs report slow requests)
2018-05-29 17:09:51.015191 mon.mon003 mon.0 11301 : cluster 
[INF] Cluster is now healthy
2018-05-29 17:14:26.178321 mds.mon002 mds.34884 8 
: cluster [WRN]  replayed op client.1495010:32710304,32710299 used ino 
0x100003909d0 but session next is 0x10000388af6
2018-05-29 17:14:26.178393 mds.mon002 mds.34884 9 
: cluster [WRN]  replayed op client.1495010:32710306,32710299 used ino 
0x100003909d1 but session next is 0x10000388af6
2018-05-29 18:00:00.000132 mon.mon003 mon.0 11304 : cluster 
[INF] overall HEALTH_OK

Thanks for looking into it!


>> Cheers,
>> Oliver
>> Am 30.05.2018 um 03:25 schrieb Yan, Zheng:
>>> I could be
>>> On Wed, May 30, 2018 at 9:01 AM, Linh Vu <> wrote:
>>>> In my case, I have multiple active MDS (with directory pinning at the very
>>>> top level), and there would be "Client xxx failing to respond to capability
>>>> release" health warning every single time that happens.
>>>> ________________________________
>>>> From: ceph-users <> on behalf of Yan, 
>>>> Zheng
>>>> <>
>>>> Sent: Tuesday, 29 May 2018 9:53:43 PM
>>>> To: Oliver Freyermuth
>>>> Cc: Ceph Users; Peter Wienemann
>>>> Subject: Re: [ceph-users] Ceph-fuse getting stuck with "currently failed to
>>>> authpin local pins"
>>>> Single or multiple acitve mds? Were there "Client xxx failing to
>>>> respond to capability release" health warning?
>>>> On Mon, May 28, 2018 at 10:38 PM, Oliver Freyermuth
>>>> <> wrote:
>>>>> Dear Cephalopodians,
>>>>> we just had a "lockup" of many MDS requests, and also trimming fell
>>>>> behind, for over 2 days.
>>>>> One of the clients (all ceph-fuse 12.2.5 on CentOS 7.5) was in status
>>>>> "currently failed to authpin local pins". Metadata pool usage did grow by 
>>>>> 10
>>>>> GB in those 2 days.
>>>>> Rebooting the node to force a client eviction solved the issue, and now
>>>>> metadata usage is down again, and all stuck requests were processed 
>>>>> quickly.
>>>>> Is there any idea on what could cause something like that? On the client,
>>>>> der was no CPU load, but many processes waiting for cephfs to respond.
>>>>> Syslog did yield anything. It only affected one user and his user
>>>>> directory.
>>>>> If there are no ideas: How can I collect good debug information in case
>>>>> this happens again?
>>>>> Cheers,
>>>>>         Oliver
>>>>> _______________________________________________
>>>>> ceph-users mailing list
>>>> _______________________________________________
>>>> ceph-users mailing list

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

ceph-users mailing list

Reply via email to