Re: [ceph-users] Inconsistent PGs caused by omap_digest mismatch

Bryan Stillwell Tue, 09 Apr 2019 13:57:30 -0700

> On Apr 8, 2019, at 5:42 PM, Bryan Stillwell <bstillw...@godaddy.com> wrote:
> 
> 
>> On Apr 8, 2019, at 4:38 PM, Gregory Farnum <gfar...@redhat.com> wrote:
>> 
>> On Mon, Apr 8, 2019 at 3:19 PM Bryan Stillwell <bstillw...@godaddy.com> 
>> wrote:
>>> 
>>> There doesn't appear to be any correlation between the OSDs which would 
>>> point to a hardware issue, and since it's happening on two different 
>>> clusters I'm wondering if there's a race condition that has been fixed in a 
>>> later version?
>>> 
>>> Also, what exactly is the omap digest?  From what I can tell it appears to 
>>> be some kind of checksum for the omap data.  Is that correct?
>> 
>> Yeah; it's just a crc over the omap key-value data that's checked
>> during deep scrub. Same as the data digest.
>> 
>> I've not noticed any issues around this in Luminous but I probably
>> wouldn't have, so will have to leave it up to others if there are
>> fixes in since 12.2.8.
> 
> Thanks for adding some clarity to that Greg!
> 
> For some added information, this is what the logs reported earlier today:
> 
> 2019-04-08 11:46:15.610169 osd.504 osd.504 10.16.10.30:6804/8874 33 : cluster 
> [ERR] 7.3 : soid 7:c09d46a1:::.dir.default.22333615.1861352:head omap_digest 
> 0x26a1241b != omap_digest 0x4c10ee76 from shard 504
> 2019-04-08 11:46:15.610190 osd.504 osd.504 10.16.10.30:6804/8874 34 : cluster 
> [ERR] 7.3 : soid 7:c09d46a1:::.dir.default.22333615.1861352:head omap_digest 
> 0x26a1241b != omap_digest 0x4c10ee76 from shard 504
> 
> I then tried deep scrubbing it again to see if the data was fine, but the 
> digest calculation was just having problems.  It came back with the same 
> problem with new digest values:
> 
> 2019-04-08 15:56:21.186291 osd.504 osd.504 10.16.10.30:6804/8874 49 : cluster 
> [ERR] 7.3 : soid 7:c09d46a1:::.dir.default.22333615.1861352:head omap_digest 
> 0x93bac8f != omap_digest 0 xab1b9c6f from shard 504
> 2019-04-08 15:56:21.186313 osd.504 osd.504 10.16.10.30:6804/8874 50 : cluster 
> [ERR] 7.3 : soid 7:c09d46a1:::.dir.default.22333615.1861352:head omap_digest 
> 0x93bac8f != omap_digest 0 xab1b9c6f from shard 504
> 
> Which makes sense, but doesn’t explain why the omap data is getting out of 
> sync across multiple OSDs and clusters…
> 
> I’ll see what I can figure out tomorrow, but if anyone else has some hints I 
> would love to hear them.


I’ve dug into this more today and it appears that the omap data contains an 
extra entry on the OSDs with the mismatched omap digests.  I then searched the 
RGW logs and found that a DELETE happened shortly after the OSD booted, but the 
omap data wasn’t updated on that OSD so it became mismatched.

Here’s a timeline of the events which caused PG 7.9 to become inconsistent:

2019-04-04 14:37:34 - osd.492 marked itself down
2019-04-04 14:40:35 - osd.492 boot
2019-04-04 14:41:55 - DELETE call happened
2019-04-08 12:06:14 - omap_digest mismatch detected (pg 7.9 is 
active+clean+inconsistent, acting [492,546,523])

Here’s the timeline for PG 7.2b:

2019-04-03 13:54:17 - osd.488 marked itself down
2019-04-03 13:59:27 - osd.488 boot
2019-04-03 14:00:54 - DELETE call happened
2019-04-08 12:42:21 - omap_digest mismatch detected (pg 7.2b is 
active+clean+inconsistent, acting [488,511,541])

Bryan
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Inconsistent PGs caused by omap_digest mismatch

Reply via email to