And to come full circle, After this whole saga, I now have a scrub error on the new device health metrics pool/PG in what looks to be the exact same way. So I am at a loss for what ever it is that I am doing incorrectly, as a scrub error obviously makes the monitoring suite very happy.
> $ ceph health detail > OSD_SCRUB_ERRORS 1 scrub errors > PG_DAMAGED Possible data damage: 1 pg inconsistent > pg 33.0 is active+clean+inconsistent, acting [12,138,15] > $ rados list-inconsistent-pg device_health_metrics > ["33.0"] > $ rados list-inconsistent-obj 33.0 | jq > { > "epoch": 176348, > "inconsistents": [] > } I assume that this is the root cause: > ceph.log.5.gz:2019-09-18 11:12:16.466118 osd.138 (osd.138) 154 : cluster > [WRN] bad locator @33 on object @33 op osd_op(client.1769585636.0:466 33.0 > 33:b08b92bd::::head [omap-set-vals] snapc 0=[] > ondisk+write+known_if_redirected e176327) v8 > ceph.log.1.gz:2019-09-22 20:41:44.937841 osd.12 (osd.12) 53 : cluster [DBG] > 33.0 scrub starts > ceph.log.1.gz:2019-09-22 20:41:45.000638 osd.12 (osd.12) 54 : cluster [ERR] > 33.0 scrub : stat mismatch, got 237/238 objects, 0/0 clones, 237/238 dirty, > 237/238 omap, 0/0 pinned, 0/0 hit_set_archive, 0/0 whiteouts, 0/0 bytes, 0/0 > manifest objects, 0/0 hit_set_archive bytes. > ceph.log.1.gz:2019-09-22 20:41:45.000643 osd.12 (osd.12) 55 : cluster [ERR] > 33.0 scrub 1 errors Nothing fancy set for the plugin: > $ ceph config dump | grep device > global basic device_failure_prediction_mode local > mgr advanced mgr/devicehealth/enable_monitoring true Reed > On Sep 18, 2019, at 11:33 AM, Reed Dier <reed.d...@focusvq.com> wrote: > > And to provide some further updates, > > I was able to get OSDs to boot by updating from 14.2.2 to 14.2.4. > Unclear why this would improve things, but it at least got me running again. > >> $ ceph versions >> { >> "mon": { >> "ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be) >> nautilus (stable)": 3 >> }, >> "mgr": { >> "ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be) >> nautilus (stable)": 3 >> }, >> "osd": { >> "ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be) >> nautilus (stable)": 199, >> "ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba) >> nautilus (stable)": 5 >> }, >> "mds": { >> "ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be) >> nautilus (stable)": 1 >> }, >> "overall": { >> "ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be) >> nautilus (stable)": 206, >> "ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba) >> nautilus (stable)": 5 >> } >> } > > > Reed > >> On Sep 18, 2019, at 10:12 AM, Reed Dier <reed.d...@focusvq.com >> <mailto:reed.d...@focusvq.com>> wrote: >> >> To answer the question, if it is safe to disable the module and delete the >> pool, the answer is no. >> >> After disabling the diskprediction_local module, I then proceeded to remove >> the pool created by the module, device_health_metrics. >> >> This is where things went south quickly, >> >> Ceph health showed: >>> Module 'devicehealth' has failed: [errno 2] Failed to operate write op for >>> oid SAMSUNG_$MODEL_$SERIAL >> >> That module apparently can't be disabled: >>> $ ceph mgr module disable devicehealth >>> Error EINVAL: module 'devicehealth' cannot be disabled (always-on) >> >> Then 5 osd's went down, crashing with: >>> -12> 2019-09-18 10:53:00.299 7f95940ac700 5 osd.5 pg_epoch: 176304 >>> pg[17.3d1( v 176297'568491 lc 176269'568471 (175914'565388,176297'568491] >>> local-lis/les=176302/176303 n=107092 ec=11397/11397 lis/c 176302/172990 >>> les/c/f 176303/172991/107766 176304/176304/176304) [5,81,162] r=0 >>> lpr=176304 pi=[172990,176304)/1 crt=176297'568491 lcod 0'0 mlcod 0'0 >>> peering m=17 mbc={}] enter Started/Primary/Peering/WaitUpThru >>> -11> 2019-09-18 10:53:00.303 7f959fd6f700 2 osd.5 176304 >>> ms_handle_reset con 0x564078474d00 session 0x56407878ea00 >>> -10> 2019-09-18 10:53:00.303 7f95b10e6700 10 monclient: >>> handle_auth_request added challenge on 0x564077ac1b00 >>> -9> 2019-09-18 10:53:00.307 7f95b10e6700 10 monclient: >>> handle_auth_request added challenge on 0x564077ac3180 >>> -8> 2019-09-18 10:53:00.307 7f95b10e6700 10 monclient: >>> handle_auth_request added challenge on 0x564077ac3600 >>> -7> 2019-09-18 10:53:00.307 7f95950ae700 -1 >>> bluestore(/var/lib/ceph/osd/ceph-5) _txc_add_transaction error (39) >>> Directory not empty not handled on operation 21 (op 1, counting from 0) >>> -6> 2019-09-18 10:53:00.307 7f95950ae700 0 _dump_transaction >>> transaction dump: >>> { >>> "ops": [ >>> { >>> "op_num": 0, >>> "op_name": "remove", >>> "collection": "30.0_head", >>> "oid": "#30:00000000::::head#" >>> }, >>> { >>> "op_num": 1, >>> "op_name": "rmcoll", >>> "collection": "30.0_head" >>> } >>> ] >>> } >>> -5> 2019-09-18 10:53:00.311 7f95948ad700 5 osd.5 pg_epoch: 176304 >>> pg[17.353( v 176300'586919 lc 176207'586887 (175912'583861,176300'586919] >>> local-lis/les=176302/176303 n=107009 ec=11397/11397 lis/c 176302/176285 >>> les/c/f 176303/176286/107766 176304/176304/176304) [5,167,137] r=0 >>> lpr=176304 pi=[176285,176304)/1 crt=176300'586919 lcod 0'0 mlcod 0'0 >>> peering m=32 mbc={}] exit Started/Primary/Peering/GetLog 0.023847 2 0.000123 >>> -4> 2019-09-18 10:53:00.311 7f95948ad700 5 osd.5 pg_epoch: 176304 >>> pg[17.353( v 176300'586919 lc 176207'586887 (175912'583861,176300'586919] >>> local-lis/les=176302/176303 n=107009 ec=11397/11397 lis/c 176302/176285 >>> les/c/f 176303/176286/107766 176304/176304/176304) [5,167,137] r=0 >>> lpr=176304 pi=[176285,176304)/1 crt=176300'586919 lcod 0'0 mlcod 0'0 >>> peering m=32 mbc={}] enter Started/Primary/Peering/GetMissing >>> -3> 2019-09-18 10:53:00.311 7f95948ad700 5 osd.5 pg_epoch: 176304 >>> pg[17.353( v 176300'586919 lc 176207'586887 (175912'583861,176300'586919] >>> local-lis/les=176302/176303 n=107009 ec=11397/11397 lis/c 176302/176285 >>> les/c/f 176303/176286/107766 176304/176304/176304) [5,167,137] r=0 >>> lpr=176304 pi=[176285,176304)/1 crt=176300'586919 lcod 0'0 mlcod 0'0 >>> peering m=32 mbc={}] exit Started/Primary/Peering/GetMissing 0.000019 0 >>> 0.000000 >>> -2> 2019-09-18 10:53:00.311 7f95948ad700 5 osd.5 pg_epoch: 176304 >>> pg[17.353( v 176300'586919 lc 176207'586887 (175912'583861,176300'586919] >>> local-lis/les=176302/176303 n=107009 ec=11397/11397 lis/c 176302/176285 >>> les/c/f 176303/176286/107766 176304/176304/176304) [5,167,137] r=0 >>> lpr=176304 pi=[176285,176304)/1 crt=176300'586919 lcod 0'0 mlcod 0'0 >>> peering m=32 mbc={}] enter Started/Primary/Peering/WaitUpThru >>> -1> 2019-09-18 10:53:00.315 7f95950ae700 -1 >>> /build/ceph-14.2.2/src/os/bluestore/BlueStore.cc <http://bluestore.cc/>: In >>> function 'void BlueStore::_txc_add_transaction(BlueStore::TransContext*, >>> ObjectStore::Transaction*)' thread 7f95950ae700 time 2019-09-18 >>> 10:53:00.312755 >>> /build/ceph-14.2.2/src/os/bluestore/BlueStore.cc <http://bluestore.cc/>: >>> 11208: ceph_abort_msg("unexpected error") >> >> >> Of the 5 OSD's now down, 3 of them are the serving OSD's for pg 30.0 (that >> has now been erased), >> >>> OSD_DOWN 5 osds down >>> osd.5 is down >>> osd.12 is down >>> osd.128 is down >>> osd.183 is down >>> osd.190 is down >> >> >> But 190 and 5 were never acting members for that PG, so I have no clue why >> they are implicated. >> >> >> I re-enabled the module, and that cleared the health error about >> devicehealth, which doesn't matter to me, but that also didn't solve the >> issue of the down OSDs, so I am hoping there is a way to mark this PG as >> lost, or something like that, so as to not have to rebuilt the entire OSD. >> >> Any help is appreciated. >> >> Reed >> >>> On Sep 12, 2019, at 5:22 PM, Reed Dier <reed.d...@focusvq.com >>> <mailto:reed.d...@focusvq.com>> wrote: >>> >>> Trying to narrow down a strange issue where the single PG for the >>> device_health_metrics that was created when I enabled the >>> 'diskprediction_local' module in the ceph-mgr. But I never see any >>> inconsistent objects in the PG. >>> >>>> $ ceph health detail >>>> OSD_SCRUB_ERRORS 1 scrub errors >>>> PG_DAMAGED Possible data damage: 1 pg inconsistent >>>> pg 30.0 is active+clean+inconsistent, acting [128,12,183] >>> >>>> $ rados list-inconsistent-pg device_health_metrics >>>> ["30.0"] >>> >>>> $ rados list-inconsistent-obj 30.0 | jq >>>> { >>>> "epoch": 172979, >>>> "inconsistents": [] >>>> } >>> >>> This is the log message from osd.128 most recently during the last deep >>> scrub >>>> 2019-09-12 18:07:19.436 7f977744a700 -1 log_channel(cluster) log [ERR] : >>>> 30.0 deep-scrub : stat mismatch, got 237/238 objects, 0/0 clones, 237/238 >>>> dirty, 237/238 omap, 0/0 pinned, 0/0 hit_set_archive, 0/0 whiteouts, 0/0 >>>> bytes, 0/0 manifest objects, 0/0 hit_set_archive bytes. >>> >>> Here is a pg query on the one PG: >>> https://pastebin.com/bnzVKd6t <https://pastebin.com/bnzVKd6t> >>> >>> The data I have collected hasn't been useful at all, and I don't >>> particularly care if I lose it, so would it be feasible (ie no bad effects) >>> to just disable the disk prediction module, delete the pool, and then start >>> over and it will create a new pool for itself? >>> >>> Thanks, >>> >>> Reed >> >> >
smime.p7s
Description: S/MIME cryptographic signature
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com