We have (still) on these OSDs filestore.

Regards
Lukasz

> Hi Igor,

> Thank    You    for   Your   input,  will  try  Your  suggestion  with
> ceph-objectstore-tool.

> But for now it looks like main problem is this:

> 2019-07-09 09:29:25.410839 7f5e4b64f700  1 heartbeat_map is_healthy
> 'OSD::osd_op_tp thread 0x7f5e20e87700' had timed out after 15
> 2019-07-09 09:29:25.410842 7f5e4b64f700  1 heartbeat_map is_healthy
> 'FileStore::op_tp thread 0x7f5e41651700' had timed out after 60

> after  this  (a  lot of) logs osd become unresponsive for monitors and
> they are marked down for a few seconds/minutes, sometimes it suicide:

> 2019-07-09 09:29:32.271361 7f5e3d649700  0 log_channel(cluster) log
> [WRN] : Monitor daemon marked osd.118 down, but it is still running
> 2019-07-09 09:29:32.271381 7f5e3d649700  0 log_channel(cluster) log
> [DBG] : map e71903 wrongly marked me down at e71902
> 2019-07-09 09:29:32.271393 7f5e3d649700  1 osd.118 71903 
> start_waiting_for_healthy


> maybe You (or any other cepher) know how to dill with this problem ?

> Regards
> Lukasz
>> Hi Lukasz,

>> I've seen something like that - slow requests and relevant OSD reboots
>> on suicide timeout at least twice with two different clusters. The root
>> cause was slow omap listing for some objects which had started to happen
>> after massive removals from RocksDB.

>> To verify if this is the case you can create a script that uses 
>> ceph-objectstore-tool to list objects for the specific pg and then 
>> list-omap for every returned object.

>> If omap listing for some object(s) takes too long (minutes in my case) -
>> you're facing the same issue.

>> PR that implements automatic lookup for such "slow" objects in 
>> ceph-objectstore-tool is under review: 
>> https://github.com/ceph/ceph/pull/27985


>> The only known workaround for existing OSDs so far is manual DB 
>> compaction. And https://github.com/ceph/ceph/pull/27627 hopefully fixes
>> the issue for newly deployed OSDs.



>> Relevant upstream tickets are:

>> http://tracker.ceph.com/issues/36482

>> http://tracker.ceph.com/issues/40557


>> Hope this helps,

>> Igor

>> On 7/3/2019 9:54 AM, Luk wrote:
>>> Hello,
>>>
>>> I have strange problem with scrubbing.
>>>
>>> When  scrubbing starts on PG which belong to default.rgw.buckets.index
>>> pool,  I  can  see that this OSD is very busy (see attachment), and starts 
>>> showing many
>>> slow  request,  after  the  scrubbing  of this PG stops, slow requests
>>> stops immediately.
>>>
>>> [root@stor-b02 /var/lib/ceph/osd/ceph-118/current]# zgrep scrub 
>>> /var/log/ceph/ceph-osd.118.log.1.gz  | grep -w 20.2
>>> 2019-07-03 00:14:57.496308 7fd4c7a09700  0 log_channel(cluster) log [DBG] : 
>>> 20.2 deep-scrub starts
>>> 2019-07-03 05:36:13.274637 7fd4ca20e700  0 log_channel(cluster) log [DBG] : 
>>> 20.2 deep-scrub ok
>>> [root@stor-b02 /var/lib/ceph/osd/ceph-118/current]#
>>>
>>> [root@stor-b02 /var/lib/ceph/osd/ceph-118/current]# du -sh 20.2_*
>>> 636K    20.2_head
>>> 0       20.2_TEMP
>>> [root@stor-b02 /var/lib/ceph/osd/ceph-118/current]# ls -1 -R 20.2_head | wc 
>>> -l
>>> 4125
>>> [root@stor-b02 /var/lib/ceph/osd/ceph-118/current]#
>>>
>>> and on mon:
>>>
>>> 2019-07-03 00:48:44.793893 mon.ceph-mon-01 mon.0 10.10.8.221:6789/0 6231090 
>>> : cluster [WRN] Health check failed: 105 slow requests are blocked > 32 
>>> sec. Implicated osds 118 (REQUEST_SLOW)
>>> 2019-07-03 00:48:54.086446 mon.ceph-mon-01 mon.0 10.10.8.221:6789/0 6231097 
>>> : cluster [WRN] Health check update: 102 slow requests are blocked > 32 
>>> sec. Implicated osds 118 (REQUEST_SLOW)
>>> 2019-07-03 00:48:59.088240 mon.ceph-mon-01 mon.0 10.10.8.221:6789/0 6231099 
>>> : cluster [WRN] Health check update: 91 slow requests are blocked > 32 sec. 
>>> Implicated osds 118 (REQUEST_SLOW)
>>>
>>> [...]
>>>
>>> 2019-07-03 05:36:19.695586 mon.ceph-mon-01 mon.0 10.10.8.221:6789/0 6243211 
>>> : cluster [INF] Health check cleared: REQUEST_SLOW (was: 23 slow requests 
>>> are blocked > 32 sec. Implicated osds 118)
>>> 2019-07-03 05:36:19.695700 mon.ceph-mon-01 mon.0 10.10.8.221:6789/0 6243212 
>>> : cluster [INF] Cluster is now healthy
>>>
>>> ceph version 12.2.9
>>>
>>> it      might      be     related     to     this     (taken     from:
>>> https://ceph.com/releases/v12-2-11-luminous-released/) ? :
>>>
>>> "
>>> There have been fixes to RGW dynamic and manual resharding, which no longer
>>> leaves behind stale bucket instances to be removed manually. For finding and
>>> cleaning up older instances from a reshard a radosgw-admin command reshard
>>> stale-instances list and reshard stale-instances rm should do the necessary
>>> cleanup.
>>> "
>>>
>>>
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com






-- 
Pozdrowienia,
 Luk

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to