On Fri, Oct 25, 2019 at 12:11 PM Pickett, Neale T <ne...@lanl.gov> wrote: > In the last week we have made a few changes to the down filesystem in an > attempt to fix what we thought was an inode problem: > > > cephfs-data-scan scan_extents # about 1 day with 64 processes > > cephfs-data-scan scan_inodes # about 1 day with 64 processes > > cephfs-data_scan scan_links # about 1 day
Did you reset the journals or perform any other disaster recovery commands? This process likely introduced the duplicate inodes. > After these three, we tried to start an MDS and it stayed up. We then ran: > > ceph tell mds.a scrub start / recursive repair > > > The repair ran about 3 days, spewing logs to `ceph -w` about duplicated > inodes, until it stopped. All looked well until we began bringing production > services back online, at which point many error messages appeared, the mds > went back into damaged, and the fs back to degraded. At this point I removed > the objects you suggested, which brought everything back briefly. > > The latest crash is: > > -1> 2019-10-25 18:47:50.731 7fc1f3b56700 -1 > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.2/rpm/el7/BUILD/ceph-14.2.2/src/mds/MDCache.cc: > In function 'void MDCache::add_inode(CInode*)' thread 7fc1f3b56700 time > 2019-1... > > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.2/rpm/el7/BUILD/ceph-14.2.2/src/mds/MDCache.cc: > 258: FAILED ceph_assert(!p) This error indicates a duplicate inode loaded into cache. Fixing this probably requires significant intervention and (meta)data loss for recent changes: - Stop/unmount all clients. (Probably already the case if the rank is damaged!) - Reset the MDS journal [1] and optionally recover any dentries first. (This will hopefully resolve the ESubtreeMap errors you pasted.) Note that some metadata may be lost through this command. - `cephfs-data_scan scan_links` again. This should repair any duplicate inodes (by dropping the older dentries). - Then you can try marking the rank as repaired. Good luck! [1] https://docs.ceph.com/docs/mimic/cephfs/disaster-recovery/#journal-truncation -- Patrick Donnelly, Ph.D. He / Him / His Senior Software Engineer Red Hat Sunnyvale, CA GPG: 19F28A586F808C2402351B93C3301A3E258DD79D _______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com