Hi All...
Just dropping a small email to share our experience on how to recover a
pg from a cephfs metadata pool.
The reason why I am sharing this information is because the general
understanding on how to recover a pg (check [1]) relies on identifying
incorrect objects by comparing checksums between the different replicas.
This procedure can not be applied for inconsistent pgs in the cephfs
metadata pool because all the objects have zero size and the real core
of the information is stored as omap key/value pairs in the osds leveldb.
As a pragmatic example, sometime ago, we had the following error:
2016-08-30 00:30:53.492626 osd.78 192.231.127.171:6828/6072 331 :
cluster [INF] 5.3d0 deep-scrub starts
2016-08-30 00:30:54.276134 osd.78 192.231.127.171:6828/6072 332 :
cluster [ERR] 5.3d0 shard 78: soid 5:0bd6d154:::602.00000000:head
omap_digest 0xf3fdfd0c != best guess omap_digest 0x23b2eae0 from
auth shard 49
2016-08-30 00:30:54.747795 osd.78 192.231.127.171:6828/6072 333 :
cluster [ERR] 5.3d0 deep-scrub 0 missing, 1 inconsistent objects
2016-08-30 00:30:54.747801 osd.78 192.231.127.171:6828/6072 334 :
cluster [ERR] 5.3d0 deep-scrub 1 errors
The acting osds for this pg were [78,59,49] and 78 was the primary.
The error is telling us that there is a divergence between the digest
for the omap information on shard / osd 78 with respect to shard / osd
49. The omap_digest is a calculated CRC32 of omap header & key/values.
Also please note that the log gives you a shard and a auth shard osd id.
This is important to understand how 'pg repair' works in this case.
Another useful way to understand what is divergent, is to use '_*rados
list-inconsistent-obj 5.3d0 | /usr/bin/json_reformat*_' which I think is
available in Jewel releases (see the result of that command after this
email). That actually tells you in a nice and clean way what is the
problematic object, what is the source of the divergence, and which osd
is problematic. In our case, the tool confirms that there is a
omap_digest_mismatch and that osd 78 is the one which is different from
the other two. Please note that the information spit out by the command
is the result of the initial pg deep scrub. if you live with that error
for some time, and your logs rotate, you may have to do a manual
deep-scrub on the pg for that command to spit out some useful
information again.
If you actually want to understand the source of our divergence, you can
go through [2], where we found that osd.78 was missing about ~500 keys
(we are still in the process of understanding why that happened).
Our fear was that, as commonly mentioned in many forums, a pg repair
would push the copies from the primary osd to its peers, leading, in our
case, to data corruption.
However, going through the code and with the help of Brad Hubbard from
RH, we understood that a pg repair triggers the copy from the auth shard
to the problematic shard. Please note that the auth shard may not be the
primary osd. In our precise case, running a 'pg repair' resulted in an
updated object in osd.78 (which is the primary osds). The timestaps of
the same objects in the peers remain unchanged. We also collected the
object list-map before and after recovery and checked that all the
previously missing keys were now present. Again, if you want details,
please check [2]
Hope this is useful for others.
Cheers
Goncalo
[1] http://ceph.com/planet/ceph-manually-repair-object/
[2] http://tracker.ceph.com/issues/17177#change-78032
# rados list-inconsistent-obj 5.3d0 | /usr/bin/json_reformat
[
{
"object": {
"name": "602.00000000",
"nspace": "",
"locator": "",
"snap": "head"
},
"missing": false,
"stat_err": false,
"read_err": false,
"data_digest_mismatch": false,
"omap_digest_mismatch": true,
"size_mismatch": false,
"attr_mismatch": false,
"shards": [
{
"osd": 49,
"missing": false,
"read_error": false,
"data_digest_mismatch": false,
"omap_digest_mismatch": false,
"size_mismatch": false,
"data_digest_mismatch_oi": false,
"omap_digest_mismatch_oi": false,
"size_mismatch_oi": false,
"size": 0,
"omap_digest": "0xaa3fd281",
"data_digest": "0xffffffff"
},
{
"osd": 59,
"missing": false,
"read_error": false,
"data_digest_mismatch": false,
"omap_digest_mismatch": false,
"size_mismatch": false,
"data_digest_mismatch_oi": false,
"omap_digest_mismatch_oi": false,
"size_mismatch_oi": false,
"size": 0,
"omap_digest": "0xaa3fd281",
"data_digest": "0xffffffff"
},
{
"osd": 78,
"missing": false,
"read_error": false,
"data_digest_mismatch": false,
"omap_digest_mismatch": true,
"size_mismatch": false,
"data_digest_mismatch_oi": false,
"omap_digest_mismatch_oi": false,
"size_mismatch_oi": false,
"size": 0,
"omap_digest": "0x7600bd9e",
"data_digest": "0xffffffff"
}
]
}
]
--
Goncalo Borges
Research Computing
ARC Centre of Excellence for Particle Physics at the Terascale
School of Physics A28 | University of Sydney, NSW 2006
T: +61 2 93511937
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com