> On Apr 8, 2019, at 5:42 PM, Bryan Stillwell <bstillw...@godaddy.com> wrote: > > >> On Apr 8, 2019, at 4:38 PM, Gregory Farnum <gfar...@redhat.com> wrote: >> >> On Mon, Apr 8, 2019 at 3:19 PM Bryan Stillwell <bstillw...@godaddy.com> >> wrote: >>> >>> There doesn't appear to be any correlation between the OSDs which would >>> point to a hardware issue, and since it's happening on two different >>> clusters I'm wondering if there's a race condition that has been fixed in a >>> later version? >>> >>> Also, what exactly is the omap digest? From what I can tell it appears to >>> be some kind of checksum for the omap data. Is that correct? >> >> Yeah; it's just a crc over the omap key-value data that's checked >> during deep scrub. Same as the data digest. >> >> I've not noticed any issues around this in Luminous but I probably >> wouldn't have, so will have to leave it up to others if there are >> fixes in since 12.2.8. > > Thanks for adding some clarity to that Greg! > > For some added information, this is what the logs reported earlier today: > > 2019-04-08 11:46:15.610169 osd.504 osd.504 10.16.10.30:6804/8874 33 : cluster > [ERR] 7.3 : soid 7:c09d46a1:::.dir.default.22333615.1861352:head omap_digest > 0x26a1241b != omap_digest 0x4c10ee76 from shard 504 > 2019-04-08 11:46:15.610190 osd.504 osd.504 10.16.10.30:6804/8874 34 : cluster > [ERR] 7.3 : soid 7:c09d46a1:::.dir.default.22333615.1861352:head omap_digest > 0x26a1241b != omap_digest 0x4c10ee76 from shard 504 > > I then tried deep scrubbing it again to see if the data was fine, but the > digest calculation was just having problems. It came back with the same > problem with new digest values: > > 2019-04-08 15:56:21.186291 osd.504 osd.504 10.16.10.30:6804/8874 49 : cluster > [ERR] 7.3 : soid 7:c09d46a1:::.dir.default.22333615.1861352:head omap_digest > 0x93bac8f != omap_digest 0 xab1b9c6f from shard 504 > 2019-04-08 15:56:21.186313 osd.504 osd.504 10.16.10.30:6804/8874 50 : cluster > [ERR] 7.3 : soid 7:c09d46a1:::.dir.default.22333615.1861352:head omap_digest > 0x93bac8f != omap_digest 0 xab1b9c6f from shard 504 > > Which makes sense, but doesn’t explain why the omap data is getting out of > sync across multiple OSDs and clusters… > > I’ll see what I can figure out tomorrow, but if anyone else has some hints I > would love to hear them.
I’ve dug into this more today and it appears that the omap data contains an extra entry on the OSDs with the mismatched omap digests. I then searched the RGW logs and found that a DELETE happened shortly after the OSD booted, but the omap data wasn’t updated on that OSD so it became mismatched. Here’s a timeline of the events which caused PG 7.9 to become inconsistent: 2019-04-04 14:37:34 - osd.492 marked itself down 2019-04-04 14:40:35 - osd.492 boot 2019-04-04 14:41:55 - DELETE call happened 2019-04-08 12:06:14 - omap_digest mismatch detected (pg 7.9 is active+clean+inconsistent, acting [492,546,523]) Here’s the timeline for PG 7.2b: 2019-04-03 13:54:17 - osd.488 marked itself down 2019-04-03 13:59:27 - osd.488 boot 2019-04-03 14:00:54 - DELETE call happened 2019-04-08 12:42:21 - omap_digest mismatch detected (pg 7.2b is active+clean+inconsistent, acting [488,511,541]) Bryan _______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com