We have (still) on these OSDs filestore. Regards Lukasz
> Hi Igor, > Thank You for Your input, will try Your suggestion with > ceph-objectstore-tool. > But for now it looks like main problem is this: > 2019-07-09 09:29:25.410839 7f5e4b64f700 1 heartbeat_map is_healthy > 'OSD::osd_op_tp thread 0x7f5e20e87700' had timed out after 15 > 2019-07-09 09:29:25.410842 7f5e4b64f700 1 heartbeat_map is_healthy > 'FileStore::op_tp thread 0x7f5e41651700' had timed out after 60 > after this (a lot of) logs osd become unresponsive for monitors and > they are marked down for a few seconds/minutes, sometimes it suicide: > 2019-07-09 09:29:32.271361 7f5e3d649700 0 log_channel(cluster) log > [WRN] : Monitor daemon marked osd.118 down, but it is still running > 2019-07-09 09:29:32.271381 7f5e3d649700 0 log_channel(cluster) log > [DBG] : map e71903 wrongly marked me down at e71902 > 2019-07-09 09:29:32.271393 7f5e3d649700 1 osd.118 71903 > start_waiting_for_healthy > maybe You (or any other cepher) know how to dill with this problem ? > Regards > Lukasz >> Hi Lukasz, >> I've seen something like that - slow requests and relevant OSD reboots >> on suicide timeout at least twice with two different clusters. The root >> cause was slow omap listing for some objects which had started to happen >> after massive removals from RocksDB. >> To verify if this is the case you can create a script that uses >> ceph-objectstore-tool to list objects for the specific pg and then >> list-omap for every returned object. >> If omap listing for some object(s) takes too long (minutes in my case) - >> you're facing the same issue. >> PR that implements automatic lookup for such "slow" objects in >> ceph-objectstore-tool is under review: >> https://github.com/ceph/ceph/pull/27985 >> The only known workaround for existing OSDs so far is manual DB >> compaction. And https://github.com/ceph/ceph/pull/27627 hopefully fixes >> the issue for newly deployed OSDs. >> Relevant upstream tickets are: >> http://tracker.ceph.com/issues/36482 >> http://tracker.ceph.com/issues/40557 >> Hope this helps, >> Igor >> On 7/3/2019 9:54 AM, Luk wrote: >>> Hello, >>> >>> I have strange problem with scrubbing. >>> >>> When scrubbing starts on PG which belong to default.rgw.buckets.index >>> pool, I can see that this OSD is very busy (see attachment), and starts >>> showing many >>> slow request, after the scrubbing of this PG stops, slow requests >>> stops immediately. >>> >>> [root@stor-b02 /var/lib/ceph/osd/ceph-118/current]# zgrep scrub >>> /var/log/ceph/ceph-osd.118.log.1.gz | grep -w 20.2 >>> 2019-07-03 00:14:57.496308 7fd4c7a09700 0 log_channel(cluster) log [DBG] : >>> 20.2 deep-scrub starts >>> 2019-07-03 05:36:13.274637 7fd4ca20e700 0 log_channel(cluster) log [DBG] : >>> 20.2 deep-scrub ok >>> [root@stor-b02 /var/lib/ceph/osd/ceph-118/current]# >>> >>> [root@stor-b02 /var/lib/ceph/osd/ceph-118/current]# du -sh 20.2_* >>> 636K 20.2_head >>> 0 20.2_TEMP >>> [root@stor-b02 /var/lib/ceph/osd/ceph-118/current]# ls -1 -R 20.2_head | wc >>> -l >>> 4125 >>> [root@stor-b02 /var/lib/ceph/osd/ceph-118/current]# >>> >>> and on mon: >>> >>> 2019-07-03 00:48:44.793893 mon.ceph-mon-01 mon.0 10.10.8.221:6789/0 6231090 >>> : cluster [WRN] Health check failed: 105 slow requests are blocked > 32 >>> sec. Implicated osds 118 (REQUEST_SLOW) >>> 2019-07-03 00:48:54.086446 mon.ceph-mon-01 mon.0 10.10.8.221:6789/0 6231097 >>> : cluster [WRN] Health check update: 102 slow requests are blocked > 32 >>> sec. Implicated osds 118 (REQUEST_SLOW) >>> 2019-07-03 00:48:59.088240 mon.ceph-mon-01 mon.0 10.10.8.221:6789/0 6231099 >>> : cluster [WRN] Health check update: 91 slow requests are blocked > 32 sec. >>> Implicated osds 118 (REQUEST_SLOW) >>> >>> [...] >>> >>> 2019-07-03 05:36:19.695586 mon.ceph-mon-01 mon.0 10.10.8.221:6789/0 6243211 >>> : cluster [INF] Health check cleared: REQUEST_SLOW (was: 23 slow requests >>> are blocked > 32 sec. Implicated osds 118) >>> 2019-07-03 05:36:19.695700 mon.ceph-mon-01 mon.0 10.10.8.221:6789/0 6243212 >>> : cluster [INF] Cluster is now healthy >>> >>> ceph version 12.2.9 >>> >>> it might be related to this (taken from: >>> https://ceph.com/releases/v12-2-11-luminous-released/) ? : >>> >>> " >>> There have been fixes to RGW dynamic and manual resharding, which no longer >>> leaves behind stale bucket instances to be removed manually. For finding and >>> cleaning up older instances from a reshard a radosgw-admin command reshard >>> stale-instances list and reshard stale-instances rm should do the necessary >>> cleanup. >>> " >>> >>> >>> _______________________________________________ >>> ceph-users mailing list >>> ceph-users@lists.ceph.com >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Pozdrowienia, Luk _______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com