Hi Ceph-Users,

I have been running into a few issue with cephFS metadata pool corruption
over the last few weeks, For background please see
tracker.ceph.com/issues/17177

# ceph -v
ceph version 10.2.5 (c461ee19ecbc0c5c330aca20f7392c9a00730367)

I am currently facing a side effect of this issue that is making repairing
an inconsistent PG in the metadata pool (pool 5) difficult and I could use
some pointers

The PG I am having the issue with is 5.c0:

0# ceph health detail
HEALTH_ERR 1 pgs inconsistent; 1 scrub errors;
noout,sortbitwise,require_jewel_osds flag(s) set
pg 5.c0 is active+clean+inconsistent, acting [38,10,29]
1 scrub errors
noout,sortbitwise,require_jewel_osds flag(s) set
#

ceph pg 5.c0 query = http://pastebin.com/9yqrArTg

rados list-inconsistent-obj 5.c0 | python -m json.tool =
http://pastebin.com/iZV1TfxE

I have looked at the error log and it reports:

2016-12-19 16:43:36.944457 osd.38 172.27.175.12:6800/194902 10 : cluster
[ERR] 5.c0 shard 38: soid 5:035881fa:::10002639cb6.00000000:head
omap_digest 0xc54c7
938 != best guess omap_digest 0xb6531260 from auth shard 10

If I attempted to repair this using 'ceph pg repair 5.c0' the cluster
health returns to OK, but if I force a deep scrub using 'ceph pg deep-scrub
5.c0' the same error is reported with exactly the same omap_digest values.

To understand the differences between the three osd's I performed the below
steps on each of the osd's 38,10,29

-Stop the osd
-ceph-objectstore-tool --op list --pgid 5.c0 --data-path
/var/lib/ceph/osd/ceph-$OSDID | grep 10002639cb6 (The output is used in the
next command)
- ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-$OSDID
'["5.c0",{"oid":"10002639cb6.00000000","key":"","snapid":-2,"hash":1602296512,"max":0,"pool":5,"namespace":"","max":0}]'
list-omap

Taking the output of the above I ran a diff and found that osd.38 has the
below difference:

# diff osd10-5.c0.txt osd38-5.c0.txt
4405a4406
> B6492C5C-A917-A77F-5F301516EC6448F5.jpg_head
#

I assumed the above is a file name, using a find on the file system I
confirmed the file did not exist So I must assume it was deleted and that
is expected, so I am happy to try and correct this difference.

As the 'ceph pg repair 5.c0' was not working next I tried following
http://ceph.com/planet/ceph-manually-repair-object/ to remove the object
from the file system. Upon doing a deep-scrub before a repair it reports
the object as missing, after running the repair the object is copied back
into the osd.38, a further deep-scrub however returns exactly the
same omap_digest values with osd.38 having a difference (
http://pastebin.com/iZV1TfxE)

I assume it is because this omap data is stored inside levelDB and not just
as extended attributes

getfattr -d
/var/lib/ceph/osd/ceph-38/current/5.60_head/DIR_0/DIR_6/DIR_C/DIR_A/100008ad724.00000000__head_CD74AC60__5
= http://pastebin.com/4Mc2mNNj

I tried to dig further into this by looking at the value of the opmap key
using

ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-38
'["5.c0",{"oid":"10002639cb6.00000000","key":"","snapid":-2,"hash":1602296512,"max":0,"pool":5,"namespace":"","max":0}]'
get-omap B6492C5C-A917-A77F-5F301516EC6448F5.jpg_head
output = http://pastebin.com/vVUmw9Qi

I also tried this on osd.29 and found it strange the value existed using
the below, but the key ' B6492C5C-A917-A77F-5F301516EC6448F5.jpg_head' is
not listed in the output of omap-list

ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-29
'["5.c0",{"oid":"10002639cb6.00000000","key":"","snapid":-2,"hash":1602296512,"max":0,"pool":5,"namespace":"","max":0}]'
get-omap B6492C5C-A917-A77F-5F301516EC6448F5.jpg_head

I maybe walking down the wrong track, but if anyone has any pointers that
could help with repairing this PG or anything else I should be looking at
to investigate further that would be very helpful.

Thanks
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to