Hi Igor, Thank You for Your input, will try Your suggestion with ceph-objectstore-tool.
But for now it looks like main problem is this: 2019-07-09 09:29:25.410839 7f5e4b64f700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f5e20e87700' had timed out after 15 2019-07-09 09:29:25.410842 7f5e4b64f700 1 heartbeat_map is_healthy 'FileStore::op_tp thread 0x7f5e41651700' had timed out after 60 after this (a lot of) logs osd become unresponsive for monitors and they are marked down for a few seconds/minutes, sometimes it suicide: 2019-07-09 09:29:32.271361 7f5e3d649700 0 log_channel(cluster) log [WRN] : Monitor daemon marked osd.118 down, but it is still running 2019-07-09 09:29:32.271381 7f5e3d649700 0 log_channel(cluster) log [DBG] : map e71903 wrongly marked me down at e71902 2019-07-09 09:29:32.271393 7f5e3d649700 1 osd.118 71903 start_waiting_for_healthy maybe You (or any other cepher) know how to dill with this problem ? Regards Lukasz > Hi Lukasz, > I've seen something like that - slow requests and relevant OSD reboots > on suicide timeout at least twice with two different clusters. The root > cause was slow omap listing for some objects which had started to happen > after massive removals from RocksDB. > To verify if this is the case you can create a script that uses > ceph-objectstore-tool to list objects for the specific pg and then > list-omap for every returned object. > If omap listing for some object(s) takes too long (minutes in my case) - > you're facing the same issue. > PR that implements automatic lookup for such "slow" objects in > ceph-objectstore-tool is under review: > https://github.com/ceph/ceph/pull/27985 > The only known workaround for existing OSDs so far is manual DB > compaction. And https://github.com/ceph/ceph/pull/27627 hopefully fixes > the issue for newly deployed OSDs. > Relevant upstream tickets are: > http://tracker.ceph.com/issues/36482 > http://tracker.ceph.com/issues/40557 > Hope this helps, > Igor > On 7/3/2019 9:54 AM, Luk wrote: >> Hello, >> >> I have strange problem with scrubbing. >> >> When scrubbing starts on PG which belong to default.rgw.buckets.index >> pool, I can see that this OSD is very busy (see attachment), and starts >> showing many >> slow request, after the scrubbing of this PG stops, slow requests >> stops immediately. >> >> [root@stor-b02 /var/lib/ceph/osd/ceph-118/current]# zgrep scrub >> /var/log/ceph/ceph-osd.118.log.1.gz | grep -w 20.2 >> 2019-07-03 00:14:57.496308 7fd4c7a09700 0 log_channel(cluster) log [DBG] : >> 20.2 deep-scrub starts >> 2019-07-03 05:36:13.274637 7fd4ca20e700 0 log_channel(cluster) log [DBG] : >> 20.2 deep-scrub ok >> [root@stor-b02 /var/lib/ceph/osd/ceph-118/current]# >> >> [root@stor-b02 /var/lib/ceph/osd/ceph-118/current]# du -sh 20.2_* >> 636K 20.2_head >> 0 20.2_TEMP >> [root@stor-b02 /var/lib/ceph/osd/ceph-118/current]# ls -1 -R 20.2_head | wc >> -l >> 4125 >> [root@stor-b02 /var/lib/ceph/osd/ceph-118/current]# >> >> and on mon: >> >> 2019-07-03 00:48:44.793893 mon.ceph-mon-01 mon.0 10.10.8.221:6789/0 6231090 >> : cluster [WRN] Health check failed: 105 slow requests are blocked > 32 sec. >> Implicated osds 118 (REQUEST_SLOW) >> 2019-07-03 00:48:54.086446 mon.ceph-mon-01 mon.0 10.10.8.221:6789/0 6231097 >> : cluster [WRN] Health check update: 102 slow requests are blocked > 32 sec. >> Implicated osds 118 (REQUEST_SLOW) >> 2019-07-03 00:48:59.088240 mon.ceph-mon-01 mon.0 10.10.8.221:6789/0 6231099 >> : cluster [WRN] Health check update: 91 slow requests are blocked > 32 sec. >> Implicated osds 118 (REQUEST_SLOW) >> >> [...] >> >> 2019-07-03 05:36:19.695586 mon.ceph-mon-01 mon.0 10.10.8.221:6789/0 6243211 >> : cluster [INF] Health check cleared: REQUEST_SLOW (was: 23 slow requests >> are blocked > 32 sec. Implicated osds 118) >> 2019-07-03 05:36:19.695700 mon.ceph-mon-01 mon.0 10.10.8.221:6789/0 6243212 >> : cluster [INF] Cluster is now healthy >> >> ceph version 12.2.9 >> >> it might be related to this (taken from: >> https://ceph.com/releases/v12-2-11-luminous-released/) ? : >> >> " >> There have been fixes to RGW dynamic and manual resharding, which no longer >> leaves behind stale bucket instances to be removed manually. For finding and >> cleaning up older instances from a reshard a radosgw-admin command reshard >> stale-instances list and reshard stale-instances rm should do the necessary >> cleanup. >> " >> >> >> _______________________________________________ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Pozdrowienia, Luk _______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com