Update: I discovered http://tracker.ceph.com/issues/24236 <http://tracker.ceph.com/issues/24236> and https://github.com/ceph/ceph/pull/22146 <https://github.com/ceph/ceph/pull/22146> Make sure that it is not relevant in your case before proceeding to operations that modify on-disk data.
> On 6.10.2018, at 03:17, Sergey Malinin <h...@newmail.com> wrote: > > I ended up rescanning the entire fs using alternate metadata pool approach as > in http://docs.ceph.com/docs/mimic/cephfs/disaster-recovery/ > <http://docs.ceph.com/docs/mimic/cephfs/disaster-recovery/> > The process has not competed yet because during the recovery our cluster > encountered another problem with OSDs that I got fixed yesterday (thanks to > Igor Fedotov @ SUSE). > The first stage (scan_extents) completed in 84 hours (120M objects in data > pool on 8 hdd OSDs on 4 hosts). The second (scan_inodes) was interrupted by > OSDs failure so I have no timing stats but it seems to be runing 2-3 times > faster than extents scan. > As to root cause -- in my case I recall that during upgrade I had forgotten > to restart 3 OSDs, one of which was holding metadata pool contents, before > restarting MDS daemons and that seemed to had an impact on MDS journal > corruption, because when I restarted those OSDs, MDS was able to start up but > soon failed throwing lots of 'loaded dup inode' errors. > > >> On 6.10.2018, at 00:41, Alfredo Daniel Rezinovsky <alfrenov...@gmail.com >> <mailto:alfrenov...@gmail.com>> wrote: >> >> Same problem... >> >> # cephfs-journal-tool --journal=purge_queue journal inspect >> 2018-10-05 18:37:10.704 7f01f60a9bc0 -1 Missing object 500.0000016c >> Overall journal integrity: DAMAGED >> Objects missing: >> 0x16c >> Corrupt regions: >> 0x5b000000-ffffffffffffffff >> >> Just after upgrade to 13.2.2 >> >> Did you fixed it? >> >> >> On 26/09/18 13:05, Sergey Malinin wrote: >>> Hello, >>> Followed standard upgrade procedure to upgrade from 13.2.1 to 13.2.2. >>> After upgrade MDS cluster is down, mds rank 0 and purge_queue journal are >>> damaged. Resetting purge_queue does not seem to work well as journal still >>> appears to be damaged. >>> Can anybody help? >>> >>> mds log: >>> >>> -789> 2018-09-26 18:42:32.527 7f70f78b1700 1 mds.mds2 Updating MDS map >>> to version 586 from mon.2 >>> -788> 2018-09-26 18:42:32.527 7f70f78b1700 1 mds.0.583 handle_mds_map i >>> am now mds.0.583 >>> -787> 2018-09-26 18:42:32.527 7f70f78b1700 1 mds.0.583 handle_mds_map >>> state change up:rejoin --> up:active >>> -786> 2018-09-26 18:42:32.527 7f70f78b1700 1 mds.0.583 recovery_done -- >>> successful recovery! >>> <skip> >>> -38> 2018-09-26 18:42:32.707 7f70f28a7700 -1 mds.0.purge_queue _consume: >>> Decode error at read_pos=0x322ec6636 >>> -37> 2018-09-26 18:42:32.707 7f70f28a7700 5 mds.beacon.mds2 >>> set_want_state: up:active -> down:damaged >>> -36> 2018-09-26 18:42:32.707 7f70f28a7700 5 mds.beacon.mds2 _send >>> down:damaged seq 137 >>> -35> 2018-09-26 18:42:32.707 7f70f28a7700 10 monclient: >>> _send_mon_message to mon.ceph3 at mon:6789/0 >>> -34> 2018-09-26 18:42:32.707 7f70f28a7700 1 -- mds:6800/e4cc09cf --> >>> mon:6789/0 -- mdsbeacon(14c72/mds2 down:damaged seq 137 v24a) v7 -- >>> 0x563b321ad480 con 0 >>> <skip> >>> -3> 2018-09-26 18:42:32.743 7f70f98b5700 5 -- mds:6800/3838577103 >> >>> mon:6789/0 conn(0x563b3213e000 :-1 >>> s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=8 cs=1 l=1). rx mon.2 seq >>> 29 0x563b321ab880 mdsbeaco >>> n(85106/mds2 down:damaged seq 311 v587) v7 >>> -2> 2018-09-26 18:42:32.743 7f70f98b5700 1 -- mds:6800/3838577103 <== >>> mon.2 mon:6789/0 29 ==== mdsbeacon(85106/mds2 down:damaged seq 311 v587) v7 >>> ==== 129+0+0 (3296573291 0 0) 0x563b321ab880 con 0x563b3213e >>> 000 >>> -1> 2018-09-26 18:42:32.743 7f70f98b5700 5 mds.beacon.mds2 >>> handle_mds_beacon down:damaged seq 311 rtt 0.038261 >>> 0> 2018-09-26 18:42:32.743 7f70f28a7700 1 mds.mds2 respawn! >>> >>> # cephfs-journal-tool --journal=purge_queue journal inspect >>> Overall journal integrity: DAMAGED >>> Corrupt regions: >>> 0x322ec65d9-ffffffffffffffff >>> >>> # cephfs-journal-tool --journal=purge_queue journal reset >>> old journal was 13470819801~8463 >>> new journal start will be 13472104448 (1276184 bytes past old end) >>> writing journal head >>> done >>> >>> # cephfs-journal-tool --journal=purge_queue journal inspect >>> 2018-09-26 19:00:52.848 7f3f9fa50bc0 -1 Missing object 500.00000c8c >>> Overall journal integrity: DAMAGED >>> Objects missing: >>> 0xc8c >>> Corrupt regions: >>> 0x323000000-ffffffffffffffff >>> _______________________________________________ >>> ceph-users mailing list >>> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com