Re: [ceph-users] MDS damaged after mimic 13.2.1 to 13.2.2 upgrade

Sergey Malinin Fri, 05 Oct 2018 17:27:01 -0700

Update:
I discovered http://tracker.ceph.com/issues/24236 
<http://tracker.ceph.com/issues/24236> and 
https://github.com/ceph/ceph/pull/22146 
<https://github.com/ceph/ceph/pull/22146>
Make sure that it is not relevant in your case before proceeding to operations 
that modify on-disk data.



> On 6.10.2018, at 03:17, Sergey Malinin <h...@newmail.com> wrote:
> 
> I ended up rescanning the entire fs using alternate metadata pool approach as 
> in http://docs.ceph.com/docs/mimic/cephfs/disaster-recovery/ 
> <http://docs.ceph.com/docs/mimic/cephfs/disaster-recovery/>
> The process has not competed yet because during the recovery our cluster 
> encountered another problem with OSDs that I got fixed yesterday (thanks to 
> Igor Fedotov @ SUSE).
> The first stage (scan_extents) completed in 84 hours (120M objects in data 
> pool on 8 hdd OSDs on 4 hosts). The second (scan_inodes) was interrupted by 
> OSDs failure so I have no timing stats but it seems to be runing 2-3 times 
> faster than extents scan.
> As to root cause -- in my case I recall that during upgrade I had forgotten 
> to restart 3 OSDs, one of which was holding metadata pool contents, before 
> restarting MDS daemons and that seemed to had an impact on MDS journal 
> corruption, because when I restarted those OSDs, MDS was able to start up but 
> soon failed throwing lots of 'loaded dup inode' errors.
> 
> 
>> On 6.10.2018, at 00:41, Alfredo Daniel Rezinovsky <alfrenov...@gmail.com 
>> <mailto:alfrenov...@gmail.com>> wrote:
>> 
>> Same problem...
>> 
>> # cephfs-journal-tool --journal=purge_queue journal inspect
>> 2018-10-05 18:37:10.704 7f01f60a9bc0 -1 Missing object 500.0000016c
>> Overall journal integrity: DAMAGED
>> Objects missing:
>>   0x16c
>> Corrupt regions:
>>   0x5b000000-ffffffffffffffff
>> 
>> Just after upgrade to 13.2.2
>> 
>> Did you fixed it?
>> 
>> 
>> On 26/09/18 13:05, Sergey Malinin wrote:
>>> Hello,
>>> Followed standard upgrade procedure to upgrade from 13.2.1 to 13.2.2.
>>> After upgrade MDS cluster is down, mds rank 0 and purge_queue journal are 
>>> damaged. Resetting purge_queue does not seem to work well as journal still 
>>> appears to be damaged.
>>> Can anybody help?
>>> 
>>> mds log:
>>> 
>>>   -789> 2018-09-26 18:42:32.527 7f70f78b1700  1 mds.mds2 Updating MDS map 
>>> to version 586 from mon.2
>>>   -788> 2018-09-26 18:42:32.527 7f70f78b1700  1 mds.0.583 handle_mds_map i 
>>> am now mds.0.583
>>>   -787> 2018-09-26 18:42:32.527 7f70f78b1700  1 mds.0.583 handle_mds_map 
>>> state change up:rejoin --> up:active
>>>   -786> 2018-09-26 18:42:32.527 7f70f78b1700  1 mds.0.583 recovery_done -- 
>>> successful recovery!
>>> <skip>
>>>    -38> 2018-09-26 18:42:32.707 7f70f28a7700 -1 mds.0.purge_queue _consume: 
>>> Decode error at read_pos=0x322ec6636
>>>    -37> 2018-09-26 18:42:32.707 7f70f28a7700  5 mds.beacon.mds2 
>>> set_want_state: up:active -> down:damaged
>>>    -36> 2018-09-26 18:42:32.707 7f70f28a7700  5 mds.beacon.mds2 _send 
>>> down:damaged seq 137
>>>    -35> 2018-09-26 18:42:32.707 7f70f28a7700 10 monclient: 
>>> _send_mon_message to mon.ceph3 at mon:6789/0
>>>    -34> 2018-09-26 18:42:32.707 7f70f28a7700  1 -- mds:6800/e4cc09cf --> 
>>> mon:6789/0 -- mdsbeacon(14c72/mds2 down:damaged seq 137 v24a) v7 -- 
>>> 0x563b321ad480 con 0
>>> <skip>
>>>     -3> 2018-09-26 18:42:32.743 7f70f98b5700  5 -- mds:6800/3838577103 >> 
>>> mon:6789/0 conn(0x563b3213e000 :-1 
>>> s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=8 cs=1 l=1). rx mon.2 seq 
>>> 29 0x563b321ab880 mdsbeaco
>>> n(85106/mds2 down:damaged seq 311 v587) v7
>>>     -2> 2018-09-26 18:42:32.743 7f70f98b5700  1 -- mds:6800/3838577103 <== 
>>> mon.2 mon:6789/0 29 ==== mdsbeacon(85106/mds2 down:damaged seq 311 v587) v7 
>>> ==== 129+0+0 (3296573291 0 0) 0x563b321ab880 con 0x563b3213e
>>> 000
>>>     -1> 2018-09-26 18:42:32.743 7f70f98b5700  5 mds.beacon.mds2 
>>> handle_mds_beacon down:damaged seq 311 rtt 0.038261
>>>      0> 2018-09-26 18:42:32.743 7f70f28a7700  1 mds.mds2 respawn!
>>> 
>>> # cephfs-journal-tool --journal=purge_queue journal inspect
>>> Overall journal integrity: DAMAGED
>>> Corrupt regions:
>>>   0x322ec65d9-ffffffffffffffff
>>> 
>>> # cephfs-journal-tool --journal=purge_queue journal reset
>>> old journal was 13470819801~8463
>>> new journal start will be 13472104448 (1276184 bytes past old end)
>>> writing journal head
>>> done
>>> 
>>> # cephfs-journal-tool --journal=purge_queue journal inspect
>>> 2018-09-26 19:00:52.848 7f3f9fa50bc0 -1 Missing object 500.00000c8c
>>> Overall journal integrity: DAMAGED
>>> Objects missing:
>>>   0xc8c
>>> Corrupt regions:
>>>   0x323000000-ffffffffffffffff
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] MDS damaged after mimic 13.2.1 to 13.2.2 upgrade

Reply via email to