On Fri, Oct 25, 2019 at 12:11 PM Pickett, Neale T <ne...@lanl.gov> wrote:
> In the last week we have made a few changes to the down filesystem in an 
> attempt to fix what we thought was an inode problem:
>
>
> cephfs-data-scan scan_extents   # about 1 day with 64 processes
>
> cephfs-data-scan scan_inodes   # about 1 day with 64 processes
>
> cephfs-data_scan scan_links   # about 1 day

Did you reset the journals or perform any other disaster recovery
commands? This process likely introduced the duplicate inodes.

> After these three, we tried to start an MDS and it stayed up. We then ran:
>
> ceph tell mds.a scrub start / recursive repair
>
>
> The repair ran about 3 days, spewing logs to `ceph -w` about duplicated 
> inodes, until it stopped. All looked well until we began bringing production 
> services back online, at which point many error messages appeared, the mds 
> went back into damaged, and the fs back to degraded. At this point I removed 
> the objects you suggested, which brought everything back briefly.
>
> The latest crash is:
>
>     -1> 2019-10-25 18:47:50.731 7fc1f3b56700 -1 
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.2/rpm/el7/BUILD/ceph-14.2.2/src/mds/MDCache.cc:
>  In function 'void MDCache::add_inode(CInode*)' thread 7fc1f3b56700 time 
> 2019-1...
>
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.2/rpm/el7/BUILD/ceph-14.2.2/src/mds/MDCache.cc:
>  258: FAILED ceph_assert(!p)

This error indicates a duplicate inode loaded into cache. Fixing this
probably requires significant intervention and (meta)data loss for
recent changes:

- Stop/unmount all clients. (Probably already the case if the rank is damaged!)

- Reset the MDS journal [1] and optionally recover any dentries first.
(This will hopefully resolve the ESubtreeMap errors you pasted.) Note
that some metadata may be lost through this command.

- `cephfs-data_scan scan_links` again. This should repair any
duplicate inodes (by dropping the older dentries).

- Then you can try marking the rank as repaired.

Good luck!

[1] 
https://docs.ceph.com/docs/mimic/cephfs/disaster-recovery/#journal-truncation


--
Patrick Donnelly, Ph.D.
He / Him / His
Senior Software Engineer
Red Hat Sunnyvale, CA
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to