And to provide some further updates, I was able to get OSDs to boot by updating from 14.2.2 to 14.2.4. Unclear why this would improve things, but it at least got me running again.
> $ ceph versions > { > "mon": { > "ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be) > nautilus (stable)": 3 > }, > "mgr": { > "ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be) > nautilus (stable)": 3 > }, > "osd": { > "ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be) > nautilus (stable)": 199, > "ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba) > nautilus (stable)": 5 > }, > "mds": { > "ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be) > nautilus (stable)": 1 > }, > "overall": { > "ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be) > nautilus (stable)": 206, > "ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba) > nautilus (stable)": 5 > } > } Reed > On Sep 18, 2019, at 10:12 AM, Reed Dier <reed.d...@focusvq.com> wrote: > > To answer the question, if it is safe to disable the module and delete the > pool, the answer is no. > > After disabling the diskprediction_local module, I then proceeded to remove > the pool created by the module, device_health_metrics. > > This is where things went south quickly, > > Ceph health showed: >> Module 'devicehealth' has failed: [errno 2] Failed to operate write op for >> oid SAMSUNG_$MODEL_$SERIAL > > That module apparently can't be disabled: >> $ ceph mgr module disable devicehealth >> Error EINVAL: module 'devicehealth' cannot be disabled (always-on) > > Then 5 osd's went down, crashing with: >> -12> 2019-09-18 10:53:00.299 7f95940ac700 5 osd.5 pg_epoch: 176304 >> pg[17.3d1( v 176297'568491 lc 176269'568471 (175914'565388,176297'568491] >> local-lis/les=176302/176303 n=107092 ec=11397/11397 lis/c 176302/172990 >> les/c/f 176303/172991/107766 176304/176304/176304) [5,81,162] r=0 lpr=176304 >> pi=[172990,176304)/1 crt=176297'568491 lcod 0'0 mlcod 0'0 peering m=17 >> mbc={}] enter Started/Primary/Peering/WaitUpThru >> -11> 2019-09-18 10:53:00.303 7f959fd6f700 2 osd.5 176304 ms_handle_reset >> con 0x564078474d00 session 0x56407878ea00 >> -10> 2019-09-18 10:53:00.303 7f95b10e6700 10 monclient: >> handle_auth_request added challenge on 0x564077ac1b00 >> -9> 2019-09-18 10:53:00.307 7f95b10e6700 10 monclient: >> handle_auth_request added challenge on 0x564077ac3180 >> -8> 2019-09-18 10:53:00.307 7f95b10e6700 10 monclient: >> handle_auth_request added challenge on 0x564077ac3600 >> -7> 2019-09-18 10:53:00.307 7f95950ae700 -1 >> bluestore(/var/lib/ceph/osd/ceph-5) _txc_add_transaction error (39) >> Directory not empty not handled on operation 21 (op 1, counting from 0) >> -6> 2019-09-18 10:53:00.307 7f95950ae700 0 _dump_transaction >> transaction dump: >> { >> "ops": [ >> { >> "op_num": 0, >> "op_name": "remove", >> "collection": "30.0_head", >> "oid": "#30:00000000::::head#" >> }, >> { >> "op_num": 1, >> "op_name": "rmcoll", >> "collection": "30.0_head" >> } >> ] >> } >> -5> 2019-09-18 10:53:00.311 7f95948ad700 5 osd.5 pg_epoch: 176304 >> pg[17.353( v 176300'586919 lc 176207'586887 (175912'583861,176300'586919] >> local-lis/les=176302/176303 n=107009 ec=11397/11397 lis/c 176302/176285 >> les/c/f 176303/176286/107766 176304/176304/176304) [5,167,137] r=0 >> lpr=176304 pi=[176285,176304)/1 crt=176300'586919 lcod 0'0 mlcod 0'0 peering >> m=32 mbc={}] exit Started/Primary/Peering/GetLog 0.023847 2 0.000123 >> -4> 2019-09-18 10:53:00.311 7f95948ad700 5 osd.5 pg_epoch: 176304 >> pg[17.353( v 176300'586919 lc 176207'586887 (175912'583861,176300'586919] >> local-lis/les=176302/176303 n=107009 ec=11397/11397 lis/c 176302/176285 >> les/c/f 176303/176286/107766 176304/176304/176304) [5,167,137] r=0 >> lpr=176304 pi=[176285,176304)/1 crt=176300'586919 lcod 0'0 mlcod 0'0 peering >> m=32 mbc={}] enter Started/Primary/Peering/GetMissing >> -3> 2019-09-18 10:53:00.311 7f95948ad700 5 osd.5 pg_epoch: 176304 >> pg[17.353( v 176300'586919 lc 176207'586887 (175912'583861,176300'586919] >> local-lis/les=176302/176303 n=107009 ec=11397/11397 lis/c 176302/176285 >> les/c/f 176303/176286/107766 176304/176304/176304) [5,167,137] r=0 >> lpr=176304 pi=[176285,176304)/1 crt=176300'586919 lcod 0'0 mlcod 0'0 peering >> m=32 mbc={}] exit Started/Primary/Peering/GetMissing 0.000019 0 0.000000 >> -2> 2019-09-18 10:53:00.311 7f95948ad700 5 osd.5 pg_epoch: 176304 >> pg[17.353( v 176300'586919 lc 176207'586887 (175912'583861,176300'586919] >> local-lis/les=176302/176303 n=107009 ec=11397/11397 lis/c 176302/176285 >> les/c/f 176303/176286/107766 176304/176304/176304) [5,167,137] r=0 >> lpr=176304 pi=[176285,176304)/1 crt=176300'586919 lcod 0'0 mlcod 0'0 peering >> m=32 mbc={}] enter Started/Primary/Peering/WaitUpThru >> -1> 2019-09-18 10:53:00.315 7f95950ae700 -1 >> /build/ceph-14.2.2/src/os/bluestore/BlueStore.cc <http://bluestore.cc/>: In >> function 'void BlueStore::_txc_add_transaction(BlueStore::TransContext*, >> ObjectStore::Transaction*)' thread 7f95950ae700 time 2019-09-18 >> 10:53:00.312755 >> /build/ceph-14.2.2/src/os/bluestore/BlueStore.cc <http://bluestore.cc/>: >> 11208: ceph_abort_msg("unexpected error") > > > Of the 5 OSD's now down, 3 of them are the serving OSD's for pg 30.0 (that > has now been erased), > >> OSD_DOWN 5 osds down >> osd.5 is down >> osd.12 is down >> osd.128 is down >> osd.183 is down >> osd.190 is down > > > But 190 and 5 were never acting members for that PG, so I have no clue why > they are implicated. > > > I re-enabled the module, and that cleared the health error about > devicehealth, which doesn't matter to me, but that also didn't solve the > issue of the down OSDs, so I am hoping there is a way to mark this PG as > lost, or something like that, so as to not have to rebuilt the entire OSD. > > Any help is appreciated. > > Reed > >> On Sep 12, 2019, at 5:22 PM, Reed Dier <reed.d...@focusvq.com >> <mailto:reed.d...@focusvq.com>> wrote: >> >> Trying to narrow down a strange issue where the single PG for the >> device_health_metrics that was created when I enabled the >> 'diskprediction_local' module in the ceph-mgr. But I never see any >> inconsistent objects in the PG. >> >>> $ ceph health detail >>> OSD_SCRUB_ERRORS 1 scrub errors >>> PG_DAMAGED Possible data damage: 1 pg inconsistent >>> pg 30.0 is active+clean+inconsistent, acting [128,12,183] >> >>> $ rados list-inconsistent-pg device_health_metrics >>> ["30.0"] >> >>> $ rados list-inconsistent-obj 30.0 | jq >>> { >>> "epoch": 172979, >>> "inconsistents": [] >>> } >> >> This is the log message from osd.128 most recently during the last deep scrub >>> 2019-09-12 18:07:19.436 7f977744a700 -1 log_channel(cluster) log [ERR] : >>> 30.0 deep-scrub : stat mismatch, got 237/238 objects, 0/0 clones, 237/238 >>> dirty, 237/238 omap, 0/0 pinned, 0/0 hit_set_archive, 0/0 whiteouts, 0/0 >>> bytes, 0/0 manifest objects, 0/0 hit_set_archive bytes. >> >> Here is a pg query on the one PG: >> https://pastebin.com/bnzVKd6t <https://pastebin.com/bnzVKd6t> >> >> The data I have collected hasn't been useful at all, and I don't >> particularly care if I lose it, so would it be feasible (ie no bad effects) >> to just disable the disk prediction module, delete the pool, and then start >> over and it will create a new pool for itself? >> >> Thanks, >> >> Reed > >
smime.p7s
Description: S/MIME cryptographic signature
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com