I have a replicated cache pool and metadata pool which reside on ssd drives, 
with a size of 2, backed by a erasure coded data pool
The cephfs filesystem was in a healthy state. I pulled an SSD drive, to perform 
an exercise in osd failure.

The cluster recognized the ssd failure, and replicated back to a healthy state, 
but I got a message saying the mds0 Metadata damage detected.


   cluster 62ed97d6-adf4-12e4-8fd5-3d9701b22b86
     health HEALTH_ERR
            mds0: Metadata damage detected
            mds0: Client master01.div18.swri.org failing to respond to cache 
pressure
     monmap e2: 3 mons at 
{ceph01=192.168.19.241:6789/0,ceph02=192.168.19.242:6789/0,ceph03=192.168.19.243:6789/0}
            election epoch 24, quorum 0,1,2 
ceph01,darkjedi-ceph02,darkjedi-ceph03
      fsmap e25: 1/1/1 up {0=-ceph04=up:active}, 1 up:standby
     osdmap e1327: 20 osds: 20 up, 20 in
            flags sortbitwise
      pgmap v11630: 1536 pgs, 3 pools, 100896 MB data, 442 kobjects
            201 GB used, 62915 GB / 63116 GB avail
                1536 active+clean

In the mds logs of the active mds, I see the following:

7fad0c4b2700  0 -- 192.168.19.244:6821/17777 >> 192.168.19.243:6805/5090 
pipe(0x7fad25885400 sd=56 :33513 s=1 pgs=0 cs=0 l=1 c=0x7fad2585f980).fault
7fad14add700  0 mds.beacon.darkjedi-ceph04 handle_mds_beacon no longer laggy
7fad101d3700  0 mds.0.cache.dir(10000016c08) _fetched missing object for [dir 
10000016c08 /usr/ [2,head] auth v=0 cv=0/0 ap=1+0+0 state=1073741952 f() n() 
hs=0+0,ss=0+0 | waiter=1 authpin=1 0x7fad25ced500]
7fad101d3700 -1 log_channel(cluster) log [ERR] : dir 10000016c08 object missing 
on disk; some files may be lost
7fad0f9d2700  0 -- 192.168.19.244:6821/17777 >> 192.168.19.242:6800/3746 
pipe(0x7fad25a4e800 sd=42 :0 s=1 pgs=0 cs=0 l=1 c=0x7fad25bd5180).fault
7fad14add700 -1 log_channel(cluster) log [ERR] : unmatched fragstat size on 
single dirfrag 10000016c08, inode has f(v0 m2016-09-14 14:00:36.654244 
13=1+12), dirfrag has f(v0 m2016-09-14 14:00:36.654244 1=0+1)
7fad14add700 -1 log_channel(cluster) log [ERR] : unmatched rstat rbytes on 
single dirfrag 10000016c08, inode has n(v77 rc2016-09-14 14:00:36.654244 
b1533163206 48173=43133+5040), dirfrag has n(v77 rc2016-09-14 14:00:36.654244 
1=0+1)
7fad101d3700 -1 log_channel(cluster) log [ERR] : unmatched rstat on 
10000016c08, inode has n(v78 rc2016-09-14 14:00:36.656244 2=0+2), dirfrags have 
n(v0 rc2016-09-14 14:00:36.656244 3=0+3)

I’m not sure why the metadata got damaged, since its being replicated, but I 
want to fix the issue, and test again. However, I cant figure out the steps to 
repair the metadata.
I saw something about running a damage ls, but I can’t seem to find a more 
detailed repair document. Any pointers to get the metadata fixed? Seems both my 
mds daemons are running correctly, but that error bothers me. Shouldn’t happen 
I think.

I tried the following command, but it doesn’t understand it….
ceph --admin-daemon /var/run/ceph/ceph-mds. ceph03.asok damage ls


I then rebooted all 4 ceph servers simultaneously (another stress test), and 
the ceph cluster came back up healthy, and the mds damaged status has been 
cleared!!  I  then replaced the ssd, put it back into service, and let the 
backfill complete. The cluster was fully healthy. I pulled another ssd, and 
repeated this process, yet I never got the damaged mds messages. Was this just 
a random metadata damage due to yanking a drive out? Is there any lingering 
affects of the metadata that I need to address?


-          Jim

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to