On Mon, Apr 8, 2019 at 3:19 PM Bryan Stillwell <bstillw...@godaddy.com> wrote:
> We have two separate RGW clusters running Luminous (12.2.8) that have started 
> seeing an increase in PGs going active+clean+inconsistent with the reason 
> being caused by an omap_digest mismatch.  Both clusters are using FileStore 
> and the inconsistent PGs are happening on the .rgw.buckets.index pool which 
> was moved from HDDs to SSDs within the last few months.
> We've been repairing them by first making sure the odd omap_digest is not the 
> primary by setting the primary-affinity to 0 if needed, doing the repair, and 
> then setting the primary-affinity back to 1.
> For example PG 7.3 went inconsistent earlier today:
> # rados list-inconsistent-obj 7.3 -f json-pretty | jq -r '.inconsistents[] | 
> .errors, .shards'
> [
>   "omap_digest_mismatch"
> ]
> [
>   {
>     "osd": 504,
>     "primary": true,
>     "errors": [],
>     "size": 0,
>     "omap_digest": "0x4c10ee76",
>     "data_digest": "0xffffffff"
>   },
>   {
>     "osd": 525,
>     "primary": false,
>     "errors": [],
>     "size": 0,
>     "omap_digest": "0x26a1241b",
>     "data_digest": "0xffffffff"
>   },
>   {
>     "osd": 556,
>     "primary": false,
>     "errors": [],
>     "size": 0,
>     "omap_digest": "0x26a1241b",
>     "data_digest": "0xffffffff"
>   }
> ]
> Since the odd omap_digest is on osd.504 and osd.504 is the primary, we would 
> set the primary-affinity to 0 with:
> # ceph osd primary-affinity osd.504 0
> Do the repair:
> # ceph pg repair 7.3
> And then once the repair is complete we would set the primary-affinity back 
> to 1 on osd.504:
> # ceph osd primary-affinity osd.504 1
> There doesn't appear to be any correlation between the OSDs which would point 
> to a hardware issue, and since it's happening on two different clusters I'm 
> wondering if there's a race condition that has been fixed in a later version?
> Also, what exactly is the omap digest?  From what I can tell it appears to be 
> some kind of checksum for the omap data.  Is that correct?

Yeah; it's just a crc over the omap key-value data that's checked
during deep scrub. Same as the data digest.

I've not noticed any issues around this in Luminous but I probably
wouldn't have, so will have to leave it up to others if there are
fixes in since 12.2.8.
ceph-users mailing list

Reply via email to