Hi Igor,

Thank    You    for   Your   input,  will  try  Your  suggestion  with
ceph-objectstore-tool.

But for now it looks like main problem is this:

2019-07-09 09:29:25.410839 7f5e4b64f700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7f5e20e87700' had timed out after 15
2019-07-09 09:29:25.410842 7f5e4b64f700  1 heartbeat_map is_healthy 
'FileStore::op_tp thread 0x7f5e41651700' had timed out after 60

after  this  (a  lot of) logs osd become unresponsive for monitors and
they are marked down for a few seconds/minutes, sometimes it suicide:

2019-07-09 09:29:32.271361 7f5e3d649700  0 log_channel(cluster) log [WRN] : 
Monitor daemon marked osd.118 down, but it is still running
2019-07-09 09:29:32.271381 7f5e3d649700  0 log_channel(cluster) log [DBG] : map 
e71903 wrongly marked me down at e71902
2019-07-09 09:29:32.271393 7f5e3d649700  1 osd.118 71903 
start_waiting_for_healthy


maybe You (or any other cepher) know how to dill with this problem ?

Regards
Lukasz
> Hi Lukasz,

> I've seen something like that - slow requests and relevant OSD reboots
> on suicide timeout at least twice with two different clusters. The root
> cause was slow omap listing for some objects which had started to happen
> after massive removals from RocksDB.

> To verify if this is the case you can create a script that uses 
> ceph-objectstore-tool to list objects for the specific pg and then 
> list-omap for every returned object.

> If omap listing for some object(s) takes too long (minutes in my case) -
> you're facing the same issue.

> PR that implements automatic lookup for such "slow" objects in 
> ceph-objectstore-tool is under review: 
> https://github.com/ceph/ceph/pull/27985


> The only known workaround for existing OSDs so far is manual DB 
> compaction. And https://github.com/ceph/ceph/pull/27627 hopefully fixes
> the issue for newly deployed OSDs.



> Relevant upstream tickets are:

> http://tracker.ceph.com/issues/36482

> http://tracker.ceph.com/issues/40557


> Hope this helps,

> Igor

> On 7/3/2019 9:54 AM, Luk wrote:
>> Hello,
>>
>> I have strange problem with scrubbing.
>>
>> When  scrubbing starts on PG which belong to default.rgw.buckets.index
>> pool,  I  can  see that this OSD is very busy (see attachment), and starts 
>> showing many
>> slow  request,  after  the  scrubbing  of this PG stops, slow requests
>> stops immediately.
>>
>> [root@stor-b02 /var/lib/ceph/osd/ceph-118/current]# zgrep scrub 
>> /var/log/ceph/ceph-osd.118.log.1.gz  | grep -w 20.2
>> 2019-07-03 00:14:57.496308 7fd4c7a09700  0 log_channel(cluster) log [DBG] : 
>> 20.2 deep-scrub starts
>> 2019-07-03 05:36:13.274637 7fd4ca20e700  0 log_channel(cluster) log [DBG] : 
>> 20.2 deep-scrub ok
>> [root@stor-b02 /var/lib/ceph/osd/ceph-118/current]#
>>
>> [root@stor-b02 /var/lib/ceph/osd/ceph-118/current]# du -sh 20.2_*
>> 636K    20.2_head
>> 0       20.2_TEMP
>> [root@stor-b02 /var/lib/ceph/osd/ceph-118/current]# ls -1 -R 20.2_head | wc 
>> -l
>> 4125
>> [root@stor-b02 /var/lib/ceph/osd/ceph-118/current]#
>>
>> and on mon:
>>
>> 2019-07-03 00:48:44.793893 mon.ceph-mon-01 mon.0 10.10.8.221:6789/0 6231090 
>> : cluster [WRN] Health check failed: 105 slow requests are blocked > 32 sec. 
>> Implicated osds 118 (REQUEST_SLOW)
>> 2019-07-03 00:48:54.086446 mon.ceph-mon-01 mon.0 10.10.8.221:6789/0 6231097 
>> : cluster [WRN] Health check update: 102 slow requests are blocked > 32 sec. 
>> Implicated osds 118 (REQUEST_SLOW)
>> 2019-07-03 00:48:59.088240 mon.ceph-mon-01 mon.0 10.10.8.221:6789/0 6231099 
>> : cluster [WRN] Health check update: 91 slow requests are blocked > 32 sec. 
>> Implicated osds 118 (REQUEST_SLOW)
>>
>> [...]
>>
>> 2019-07-03 05:36:19.695586 mon.ceph-mon-01 mon.0 10.10.8.221:6789/0 6243211 
>> : cluster [INF] Health check cleared: REQUEST_SLOW (was: 23 slow requests 
>> are blocked > 32 sec. Implicated osds 118)
>> 2019-07-03 05:36:19.695700 mon.ceph-mon-01 mon.0 10.10.8.221:6789/0 6243212 
>> : cluster [INF] Cluster is now healthy
>>
>> ceph version 12.2.9
>>
>> it      might      be     related     to     this     (taken     from:
>> https://ceph.com/releases/v12-2-11-luminous-released/) ? :
>>
>> "
>> There have been fixes to RGW dynamic and manual resharding, which no longer
>> leaves behind stale bucket instances to be removed manually. For finding and
>> cleaning up older instances from a reshard a radosgw-admin command reshard
>> stale-instances list and reshard stale-instances rm should do the necessary
>> cleanup.
>> "
>>
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Pozdrowienia,
 Luk

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to