Re: [ceph-users] MDS damaged after mimic 13.2.1 to 13.2.2 upgrade

Yan, Zheng Mon, 08 Oct 2018 03:07:55 -0700

On Mon, Oct 8, 2018 at 5:43 PM Sergey Malinin <h...@newmail.com> wrote:
>
>
>
> > On 8.10.2018, at 12:37, Yan, Zheng <uker...@gmail.com> wrote:
> >
> > On Mon, Oct 8, 2018 at 4:37 PM Sergey Malinin <h...@newmail.com> wrote:
> >>
> >> What additional steps need to be taken in order to (try to) regain access 
> >> to the fs providing that I backed up metadata pool, created alternate 
> >> metadata pool and ran scan_extents, scan_links, scan_inodes, and somewhat 
> >> recursive scrub.
> >> After that I only mounted the fs read-only to backup the data.
> >> Would anything even work if I had mds journal and purge queue truncated?
> >>
> >
> > did you backed up whole metadata pool?  did you make any modification
> > to the original metadata pool? If you did, what modifications?
>
> I backed up both journal and purge queue and used cephfs-journal-tool to 
> recover dentries, then reset journal and purge queue on original metadata 
> pool.


You can try restoring original journal and purge queue, then downgrade
mds to 13.2.1.   Journal objects names are 20x.xxxxxxxx, purge queue
objects names are 50x.xxxxxxxxx.

> Before proceeding to alternate metadata pool recovery I was able to start MDS 
> but it soon failed throwing lots of 'loaded dup inode' errors, not sure if 
> that involved changing anything in the pool.
> I have left the original metadata pool untouched sine then.
>
>
> >
> > Yan, Zheng
> >
> >>
> >>> On 8.10.2018, at 05:15, Yan, Zheng <uker...@gmail.com> wrote:
> >>>
> >>> Sorry. this is caused wrong backport. downgrading mds to 13.2.1 and
> >>> marking mds repaird can resolve this.
> >>>
> >>> Yan, Zheng
> >>> On Sat, Oct 6, 2018 at 8:26 AM Sergey Malinin <h...@newmail.com> wrote:
> >>>>
> >>>> Update:
> >>>> I discovered http://tracker.ceph.com/issues/24236 and 
> >>>> https://github.com/ceph/ceph/pull/22146
> >>>> Make sure that it is not relevant in your case before proceeding to 
> >>>> operations that modify on-disk data.
> >>>>
> >>>>
> >>>> On 6.10.2018, at 03:17, Sergey Malinin <h...@newmail.com> wrote:
> >>>>
> >>>> I ended up rescanning the entire fs using alternate metadata pool 
> >>>> approach as in http://docs.ceph.com/docs/mimic/cephfs/disaster-recovery/
> >>>> The process has not competed yet because during the recovery our cluster 
> >>>> encountered another problem with OSDs that I got fixed yesterday (thanks 
> >>>> to Igor Fedotov @ SUSE).
> >>>> The first stage (scan_extents) completed in 84 hours (120M objects in 
> >>>> data pool on 8 hdd OSDs on 4 hosts). The second (scan_inodes) was 
> >>>> interrupted by OSDs failure so I have no timing stats but it seems to be 
> >>>> runing 2-3 times faster than extents scan.
> >>>> As to root cause -- in my case I recall that during upgrade I had 
> >>>> forgotten to restart 3 OSDs, one of which was holding metadata pool 
> >>>> contents, before restarting MDS daemons and that seemed to had an impact 
> >>>> on MDS journal corruption, because when I restarted those OSDs, MDS was 
> >>>> able to start up but soon failed throwing lots of 'loaded dup inode' 
> >>>> errors.
> >>>>
> >>>>
> >>>> On 6.10.2018, at 00:41, Alfredo Daniel Rezinovsky 
> >>>> <alfrenov...@gmail.com> wrote:
> >>>>
> >>>> Same problem...
> >>>>
> >>>> # cephfs-journal-tool --journal=purge_queue journal inspect
> >>>> 2018-10-05 18:37:10.704 7f01f60a9bc0 -1 Missing object 500.0000016c
> >>>> Overall journal integrity: DAMAGED
> >>>> Objects missing:
> >>>> 0x16c
> >>>> Corrupt regions:
> >>>> 0x5b000000-ffffffffffffffff
> >>>>
> >>>> Just after upgrade to 13.2.2
> >>>>
> >>>> Did you fixed it?
> >>>>
> >>>>
> >>>> On 26/09/18 13:05, Sergey Malinin wrote:
> >>>>
> >>>> Hello,
> >>>> Followed standard upgrade procedure to upgrade from 13.2.1 to 13.2.2.
> >>>> After upgrade MDS cluster is down, mds rank 0 and purge_queue journal 
> >>>> are damaged. Resetting purge_queue does not seem to work well as journal 
> >>>> still appears to be damaged.
> >>>> Can anybody help?
> >>>>
> >>>> mds log:
> >>>>
> >>>> -789> 2018-09-26 18:42:32.527 7f70f78b1700  1 mds.mds2 Updating MDS map 
> >>>> to version 586 from mon.2
> >>>> -788> 2018-09-26 18:42:32.527 7f70f78b1700  1 mds.0.583 handle_mds_map i 
> >>>> am now mds.0.583
> >>>> -787> 2018-09-26 18:42:32.527 7f70f78b1700  1 mds.0.583 handle_mds_map 
> >>>> state change up:rejoin --> up:active
> >>>> -786> 2018-09-26 18:42:32.527 7f70f78b1700  1 mds.0.583 recovery_done -- 
> >>>> successful recovery!
> >>>> <skip>
> >>>>  -38> 2018-09-26 18:42:32.707 7f70f28a7700 -1 mds.0.purge_queue 
> >>>> _consume: Decode error at read_pos=0x322ec6636
> >>>>  -37> 2018-09-26 18:42:32.707 7f70f28a7700  5 mds.beacon.mds2 
> >>>> set_want_state: up:active -> down:damaged
> >>>>  -36> 2018-09-26 18:42:32.707 7f70f28a7700  5 mds.beacon.mds2 _send 
> >>>> down:damaged seq 137
> >>>>  -35> 2018-09-26 18:42:32.707 7f70f28a7700 10 monclient: 
> >>>> _send_mon_message to mon.ceph3 at mon:6789/0
> >>>>  -34> 2018-09-26 18:42:32.707 7f70f28a7700  1 -- mds:6800/e4cc09cf --> 
> >>>> mon:6789/0 -- mdsbeacon(14c72/mds2 down:damaged seq 137 v24a) v7 -- 
> >>>> 0x563b321ad480 con 0
> >>>> <skip>
> >>>>   -3> 2018-09-26 18:42:32.743 7f70f98b5700  5 -- mds:6800/3838577103 >> 
> >>>> mon:6789/0 conn(0x563b3213e000 :-1 
> >>>> s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=8 cs=1 l=1). rx mon.2 
> >>>> seq 29 0x563b321ab880 mdsbeaco
> >>>> n(85106/mds2 down:damaged seq 311 v587) v7
> >>>>   -2> 2018-09-26 18:42:32.743 7f70f98b5700  1 -- mds:6800/3838577103 <== 
> >>>> mon.2 mon:6789/0 29 ==== mdsbeacon(85106/mds2 down:damaged seq 311 v587) 
> >>>> v7 ==== 129+0+0 (3296573291 0 0) 0x563b321ab880 con 0x563b3213e
> >>>> 000
> >>>>   -1> 2018-09-26 18:42:32.743 7f70f98b5700  5 mds.beacon.mds2 
> >>>> handle_mds_beacon down:damaged seq 311 rtt 0.038261
> >>>>    0> 2018-09-26 18:42:32.743 7f70f28a7700  1 mds.mds2 respawn!
> >>>>
> >>>> # cephfs-journal-tool --journal=purge_queue journal inspect
> >>>> Overall journal integrity: DAMAGED
> >>>> Corrupt regions:
> >>>> 0x322ec65d9-ffffffffffffffff
> >>>>
> >>>> # cephfs-journal-tool --journal=purge_queue journal reset
> >>>> old journal was 13470819801~8463
> >>>> new journal start will be 13472104448 (1276184 bytes past old end)
> >>>> writing journal head
> >>>> done
> >>>>
> >>>> # cephfs-journal-tool --journal=purge_queue journal inspect
> >>>> 2018-09-26 19:00:52.848 7f3f9fa50bc0 -1 Missing object 500.00000c8c
> >>>> Overall journal integrity: DAMAGED
> >>>> Objects missing:
> >>>> 0xc8c
> >>>> Corrupt regions:
> >>>> 0x323000000-ffffffffffffffff
> >>>> _______________________________________________
> >>>> ceph-users mailing list
> >>>> ceph-users@lists.ceph.com
> >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>>
> >>>>
> >>>>
> >>>> _______________________________________________
> >>>> ceph-users mailing list
> >>>> ceph-users@lists.ceph.com
> >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] MDS damaged after mimic 13.2.1 to 13.2.2 upgrade

Reply via email to