And to provide some further updates,

I was able to get OSDs to boot by updating from 14.2.2 to 14.2.4.
Unclear why this would improve things, but it at least got me running again.

> $ ceph versions
> {
>     "mon": {
>         "ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be) 
> nautilus (stable)": 3
>     },
>     "mgr": {
>         "ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be) 
> nautilus (stable)": 3
>     },
>     "osd": {
>         "ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be) 
> nautilus (stable)": 199,
>         "ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba) 
> nautilus (stable)": 5
>     },
>     "mds": {
>         "ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be) 
> nautilus (stable)": 1
>     },
>     "overall": {
>         "ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be) 
> nautilus (stable)": 206,
>         "ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba) 
> nautilus (stable)": 5
>     }
> }


Reed

> On Sep 18, 2019, at 10:12 AM, Reed Dier <reed.d...@focusvq.com> wrote:
> 
> To answer the question, if it is safe to disable the module and delete the 
> pool, the answer is no.
> 
> After disabling the diskprediction_local module, I then proceeded to remove 
> the pool created by the module, device_health_metrics.
> 
> This is where things went south quickly,
> 
> Ceph health showed: 
>> Module 'devicehealth' has failed: [errno 2] Failed to operate write op for 
>> oid SAMSUNG_$MODEL_$SERIAL
> 
> That module apparently can't be disabled:
>> $ ceph mgr module disable devicehealth
>> Error EINVAL: module 'devicehealth' cannot be disabled (always-on)
> 
> Then 5 osd's went down, crashing with:
>>    -12> 2019-09-18 10:53:00.299 7f95940ac700  5 osd.5 pg_epoch: 176304 
>> pg[17.3d1( v 176297'568491 lc 176269'568471 (175914'565388,176297'568491] 
>> local-lis/les=176302/176303 n=107092 ec=11397/11397 lis/c 176302/172990 
>> les/c/f 176303/172991/107766 176304/176304/176304) [5,81,162] r=0 lpr=176304 
>> pi=[172990,176304)/1 crt=176297'568491 lcod 0'0 mlcod 0'0 peering m=17 
>> mbc={}] enter Started/Primary/Peering/WaitUpThru
>>    -11> 2019-09-18 10:53:00.303 7f959fd6f700  2 osd.5 176304 ms_handle_reset 
>> con 0x564078474d00 session 0x56407878ea00
>>    -10> 2019-09-18 10:53:00.303 7f95b10e6700 10 monclient: 
>> handle_auth_request added challenge on 0x564077ac1b00
>>     -9> 2019-09-18 10:53:00.307 7f95b10e6700 10 monclient: 
>> handle_auth_request added challenge on 0x564077ac3180
>>     -8> 2019-09-18 10:53:00.307 7f95b10e6700 10 monclient: 
>> handle_auth_request added challenge on 0x564077ac3600
>>     -7> 2019-09-18 10:53:00.307 7f95950ae700 -1 
>> bluestore(/var/lib/ceph/osd/ceph-5) _txc_add_transaction error (39) 
>> Directory not empty not handled on operation 21 (op 1, counting from 0)
>>     -6> 2019-09-18 10:53:00.307 7f95950ae700  0 _dump_transaction 
>> transaction dump:
>> {
>>     "ops": [
>>         {
>>             "op_num": 0,
>>             "op_name": "remove",
>>             "collection": "30.0_head",
>>             "oid": "#30:00000000::::head#"
>>         },
>>         {
>>             "op_num": 1,
>>             "op_name": "rmcoll",
>>             "collection": "30.0_head"
>>         }
>>     ]
>> }
>>     -5> 2019-09-18 10:53:00.311 7f95948ad700  5 osd.5 pg_epoch: 176304 
>> pg[17.353( v 176300'586919 lc 176207'586887 (175912'583861,176300'586919] 
>> local-lis/les=176302/176303 n=107009 ec=11397/11397 lis/c 176302/176285 
>> les/c/f 176303/176286/107766 176304/176304/176304) [5,167,137] r=0 
>> lpr=176304 pi=[176285,176304)/1 crt=176300'586919 lcod 0'0 mlcod 0'0 peering 
>> m=32 mbc={}] exit Started/Primary/Peering/GetLog 0.023847 2 0.000123
>>     -4> 2019-09-18 10:53:00.311 7f95948ad700  5 osd.5 pg_epoch: 176304 
>> pg[17.353( v 176300'586919 lc 176207'586887 (175912'583861,176300'586919] 
>> local-lis/les=176302/176303 n=107009 ec=11397/11397 lis/c 176302/176285 
>> les/c/f 176303/176286/107766 176304/176304/176304) [5,167,137] r=0 
>> lpr=176304 pi=[176285,176304)/1 crt=176300'586919 lcod 0'0 mlcod 0'0 peering 
>> m=32 mbc={}] enter Started/Primary/Peering/GetMissing
>>     -3> 2019-09-18 10:53:00.311 7f95948ad700  5 osd.5 pg_epoch: 176304 
>> pg[17.353( v 176300'586919 lc 176207'586887 (175912'583861,176300'586919] 
>> local-lis/les=176302/176303 n=107009 ec=11397/11397 lis/c 176302/176285 
>> les/c/f 176303/176286/107766 176304/176304/176304) [5,167,137] r=0 
>> lpr=176304 pi=[176285,176304)/1 crt=176300'586919 lcod 0'0 mlcod 0'0 peering 
>> m=32 mbc={}] exit Started/Primary/Peering/GetMissing 0.000019 0 0.000000
>>     -2> 2019-09-18 10:53:00.311 7f95948ad700  5 osd.5 pg_epoch: 176304 
>> pg[17.353( v 176300'586919 lc 176207'586887 (175912'583861,176300'586919] 
>> local-lis/les=176302/176303 n=107009 ec=11397/11397 lis/c 176302/176285 
>> les/c/f 176303/176286/107766 176304/176304/176304) [5,167,137] r=0 
>> lpr=176304 pi=[176285,176304)/1 crt=176300'586919 lcod 0'0 mlcod 0'0 peering 
>> m=32 mbc={}] enter Started/Primary/Peering/WaitUpThru
>>     -1> 2019-09-18 10:53:00.315 7f95950ae700 -1 
>> /build/ceph-14.2.2/src/os/bluestore/BlueStore.cc <http://bluestore.cc/>: In 
>> function 'void BlueStore::_txc_add_transaction(BlueStore::TransContext*, 
>> ObjectStore::Transaction*)' thread 7f95950ae700 time 2019-09-18 
>> 10:53:00.312755
>> /build/ceph-14.2.2/src/os/bluestore/BlueStore.cc <http://bluestore.cc/>: 
>> 11208: ceph_abort_msg("unexpected error")
> 
> 
> Of the 5 OSD's now down, 3 of them are the serving OSD's for pg 30.0 (that 
> has now been erased),
> 
>> OSD_DOWN 5 osds down
>>     osd.5 is down
>>     osd.12 is down
>>     osd.128 is down
>>     osd.183 is down
>>     osd.190 is down
> 
> 
> But 190 and 5 were never acting members for that PG, so I have no clue why 
> they are implicated.
> 
> 
> I re-enabled the module, and that cleared the health error about 
> devicehealth, which doesn't matter to me, but that also didn't solve the 
> issue of the down OSDs, so I am hoping there is a way to mark this PG as 
> lost, or something like that, so as to not have to rebuilt the entire OSD.
> 
> Any help is appreciated.
> 
> Reed
> 
>> On Sep 12, 2019, at 5:22 PM, Reed Dier <reed.d...@focusvq.com 
>> <mailto:reed.d...@focusvq.com>> wrote:
>> 
>> Trying to narrow down a strange issue where the single PG for the 
>> device_health_metrics that was created when I enabled the 
>> 'diskprediction_local' module in the ceph-mgr. But I never see any 
>> inconsistent objects in the PG.
>> 
>>> $ ceph health detail
>>> OSD_SCRUB_ERRORS 1 scrub errors
>>> PG_DAMAGED Possible data damage: 1 pg inconsistent
>>>     pg 30.0 is active+clean+inconsistent, acting [128,12,183]
>> 
>>> $ rados list-inconsistent-pg device_health_metrics
>>> ["30.0"]
>> 
>>> $ rados list-inconsistent-obj 30.0 | jq
>>> {
>>>   "epoch": 172979,
>>>   "inconsistents": []
>>> }
>> 
>> This is the log message from osd.128 most recently during the last deep scrub
>>> 2019-09-12 18:07:19.436 7f977744a700 -1 log_channel(cluster) log [ERR] : 
>>> 30.0 deep-scrub : stat mismatch, got 237/238 objects, 0/0 clones, 237/238 
>>> dirty, 237/238 omap, 0/0 pinned, 0/0 hit_set_archive, 0/0 whiteouts, 0/0 
>>> bytes, 0/0 manifest objects, 0/0 hit_set_archive bytes.
>> 
>> Here is a pg query on the one PG:
>> https://pastebin.com/bnzVKd6t <https://pastebin.com/bnzVKd6t>
>> 
>> The data I have collected hasn't been useful at all, and I don't 
>> particularly care if I lose it, so would it be feasible (ie no bad effects) 
>> to just disable the disk prediction module, delete the pool, and then start 
>> over and it will create a new pool for itself?
>> 
>> Thanks,
>> 
>> Reed
> 
> 

Attachment: smime.p7s
Description: S/MIME cryptographic signature

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to