Can you open a tracker for this Dan and provide scrub logs with
debug_osd=20 and rados list-inconsistent-obj output?

On Mon, Jun 3, 2019 at 10:44 PM Dan van der Ster <d...@vanderster.com> wrote:
>
> Hi Reed and Brad,
>
> Did you ever learn more about this problem?
> We currently have a few inconsistencies arriving with the same env
> (cephfs, v13.2.5) and symptoms.
>
> PG Repair doesn't fix the inconsistency, nor does Brad's omap
> workaround earlier in the thread.
> In our case, we can fix by cp'ing the file to a new inode, deleting
> the inconsistent file, then scrubbing the PG.
>
> -- Dan
>
>
> On Fri, May 3, 2019 at 3:18 PM Reed Dier <reed.d...@focusvq.com> wrote:
> >
> > Just to follow up for the sake of the mailing list,
> >
> > I had not had a chance to attempt your steps yet, but things appear to have 
> > worked themselves out on their own.
> >
> > Both scrub errors cleared without intervention, and I'm not sure if it is 
> > the results of that object getting touched in CephFS that triggered the 
> > update of the size info, or if something else was able to clear it.
> >
> > Didn't see anything relating to the clearing in mon, mgr, or osd logs.
> >
> > So, not entirely sure what fixed it, but it is resolved on its own.
> >
> > Thanks,
> >
> > Reed
> >
> > On Apr 30, 2019, at 8:01 PM, Brad Hubbard <bhubb...@redhat.com> wrote:
> >
> > On Wed, May 1, 2019 at 10:54 AM Brad Hubbard <bhubb...@redhat.com> wrote:
> >
> >
> > Which size is correct?
> >
> >
> > Sorry, accidental discharge =D
> >
> > If the object info size is *incorrect* try forcing a write to the OI
> > with something like the following.
> >
> > 1. rados -p [name_of_pool_17] setomapval 10008536718.00000000
> > temporary-key anything
> > 2. ceph pg deep-scrub 17.2b9
> > 3. Wait for the scrub to finish
> > 4. rados -p [name_of_pool_2] rmomapkey 10008536718.00000000 temporary-key
> >
> > If the object info size is *correct* you could try just doing a rados
> > get followed by a rados put of the object to see if the size is
> > updated correctly.
> >
> > It's more likely the object info size is wrong IMHO.
> >
> >
> > On Tue, Apr 30, 2019 at 1:06 AM Reed Dier <reed.d...@focusvq.com> wrote:
> >
> >
> > Hi list,
> >
> > Woke up this morning to two PG's reporting scrub errors, in a way that I 
> > haven't seen before.
> >
> > $ ceph versions
> > {
> >    "mon": {
> >        "ceph version 13.2.5 (cbff874f9007f1869bfd3821b7e33b2a6ffd4988) 
> > mimic (stable)": 3
> >    },
> >    "mgr": {
> >        "ceph version 13.2.5 (cbff874f9007f1869bfd3821b7e33b2a6ffd4988) 
> > mimic (stable)": 3
> >    },
> >    "osd": {
> >        "ceph version 13.2.4 (b10be4d44915a4d78a8e06aa31919e74927b142e) 
> > mimic (stable)": 156
> >    },
> >    "mds": {
> >        "ceph version 13.2.5 (cbff874f9007f1869bfd3821b7e33b2a6ffd4988) 
> > mimic (stable)": 2
> >    },
> >    "overall": {
> >        "ceph version 13.2.4 (b10be4d44915a4d78a8e06aa31919e74927b142e) 
> > mimic (stable)": 156,
> >        "ceph version 13.2.5 (cbff874f9007f1869bfd3821b7e33b2a6ffd4988) 
> > mimic (stable)": 8
> >    }
> > }
> >
> >
> > OSD_SCRUB_ERRORS 8 scrub errors
> > PG_DAMAGED Possible data damage: 2 pgs inconsistent
> >    pg 17.72 is active+clean+inconsistent, acting [3,7,153]
> >    pg 17.2b9 is active+clean+inconsistent, acting [19,7,16]
> >
> >
> > Here is what $rados list-inconsistent-obj 17.2b9 --format=json-pretty 
> > yields:
> >
> > {
> >    "epoch": 134582,
> >    "inconsistents": [
> >        {
> >            "object": {
> >                "name": "10008536718.00000000",
> >                "nspace": "",
> >                "locator": "",
> >                "snap": "head",
> >                "version": 0
> >            },
> >            "errors": [],
> >            "union_shard_errors": [
> >                "obj_size_info_mismatch"
> >            ],
> >            "shards": [
> >                {
> >                    "osd": 7,
> >                    "primary": false,
> >                    "errors": [
> >                        "obj_size_info_mismatch"
> >                    ],
> >                    "size": 5883,
> >                    "object_info": {
> >                        "oid": {
> >                            "oid": "10008536718.00000000",
> >                            "key": "",
> >                            "snapid": -2,
> >                            "hash": 1752643257,
> >                            "max": 0,
> >                            "pool": 17,
> >                            "namespace": ""
> >                        },
> >                        "version": "134599'448331",
> >                        "prior_version": "134599'448330",
> >                        "last_reqid": "client.1580931080.0:671854",
> >                        "user_version": 448331,
> >                        "size": 3505,
> >                        "mtime": "2019-04-28 15:32:20.003519",
> >                        "local_mtime": "2019-04-28 15:32:25.991015",
> >                        "lost": 0,
> >                        "flags": [
> >                            "dirty",
> >                            "data_digest",
> >                            "omap_digest"
> >                        ],
> >                        "truncate_seq": 899,
> >                        "truncate_size": 0,
> >                        "data_digest": "0xf99a3bd3",
> >                        "omap_digest": "0xffffffff",
> >                        "expected_object_size": 0,
> >                        "expected_write_size": 0,
> >                        "alloc_hint_flags": 0,
> >                        "manifest": {
> >                            "type": 0
> >                        },
> >                        "watchers": {}
> >                    }
> >                },
> >                {
> >                    "osd": 16,
> >                    "primary": false,
> >                    "errors": [
> >                        "obj_size_info_mismatch"
> >                    ],
> >                    "size": 5883,
> >                    "object_info": {
> >                        "oid": {
> >                            "oid": "10008536718.00000000",
> >                            "key": "",
> >                            "snapid": -2,
> >                            "hash": 1752643257,
> >                            "max": 0,
> >                            "pool": 17,
> >                            "namespace": ""
> >                        },
> >                        "version": "134599'448331",
> >                        "prior_version": "134599'448330",
> >                        "last_reqid": "client.1580931080.0:671854",
> >                        "user_version": 448331,
> >                        "size": 3505,
> >                        "mtime": "2019-04-28 15:32:20.003519",
> >                        "local_mtime": "2019-04-28 15:32:25.991015",
> >                        "lost": 0,
> >                        "flags": [
> >                            "dirty",
> >                            "data_digest",
> >                            "omap_digest"
> >                        ],
> >                        "truncate_seq": 899,
> >                        "truncate_size": 0,
> >                        "data_digest": "0xf99a3bd3",
> >                        "omap_digest": "0xffffffff",
> >                        "expected_object_size": 0,
> >                        "expected_write_size": 0,
> >                        "alloc_hint_flags": 0,
> >                        "manifest": {
> >                            "type": 0
> >                        },
> >                        "watchers": {}
> >                    }
> >                },
> >                {
> >                    "osd": 19,
> >                    "primary": true,
> >                    "errors": [
> >                        "obj_size_info_mismatch"
> >                    ],
> >                    "size": 5883,
> >                    "object_info": {
> >                        "oid": {
> >                            "oid": "10008536718.00000000",
> >                            "key": "",
> >                            "snapid": -2,
> >                            "hash": 1752643257,
> >                            "max": 0,
> >                            "pool": 17,
> >                            "namespace": ""
> >                        },
> >                        "version": "134599'448331",
> >                        "prior_version": "134599'448330",
> >                        "last_reqid": "client.1580931080.0:671854",
> >                        "user_version": 448331,
> >                        "size": 3505,
> >                        "mtime": "2019-04-28 15:32:20.003519",
> >                        "local_mtime": "2019-04-28 15:32:25.991015",
> >                        "lost": 0,
> >                        "flags": [
> >                            "dirty",
> >                            "data_digest",
> >                            "omap_digest"
> >                        ],
> >                        "truncate_seq": 899,
> >                        "truncate_size": 0,
> >                        "data_digest": "0xf99a3bd3",
> >                        "omap_digest": "0xffffffff",
> >                        "expected_object_size": 0,
> >                        "expected_write_size": 0,
> >                        "alloc_hint_flags": 0,
> >                        "manifest": {
> >                            "type": 0
> >                        },
> >                        "watchers": {}
> >                    }
> >                }
> >            ]
> >        }
> >    ]
> > }
> >
> >
> > To snip that down to the parts that appear to matter:
> >
> > "errors": [],
> >        "union_shard_errors": [
> >            "obj_size_info_mismatch"
> >            ],
> >            "shards": [
> >                {
> >                    "errors": [
> >                        "obj_size_info_mismatch"
> >                    ],
> >                    "size": 5883,
> >                    "object_info": {
> >                       "size": 3505, }
> >
> >
> > It looks like the size info, does in fact mismatch (5883 != 3505).
> >
> > So I attempted a deep-scrub again, and the issue persists across both PG's.
> >
> > 2019-04-29 09:08:27.729 7fe4f5bee700  0 log_channel(cluster) log [DBG] : 
> > 17.2b9 deep-scrub starts
> > 2019-04-29 09:22:53.363 7fe4f5bee700 -1 log_channel(cluster) log [ERR] : 
> > 17.2b9 shard 19 soid 17:9d6cee
> > 16:::10008536718.00000000:head : candidate size 5883 info size 3505 mismatch
> > 2019-04-29 09:22:53.363 7fe4f5bee700 -1 log_channel(cluster) log [ERR] : 
> > 17.2b9 shard 7 soid 17:9d6cee1
> > 6:::10008536718.00000000:head : candidate size 5883 info size 3505 mismatch
> > 2019-04-29 09:22:53.363 7fe4f5bee700 -1 log_channel(cluster) log [ERR] : 
> > 17.2b9 shard 16 soid 17:9d6cee
> > 16:::10008536718.00000000:head : candidate size 5883 info size 3505 mismatch
> > 2019-04-29 09:22:53.363 7fe4f5bee700 -1 log_channel(cluster) log [ERR] : 
> > 17.2b9 soid 17:9d6cee16:::1000
> > 8536718.00000000:head : failed to pick suitable object info
> > 2019-04-29 09:22:53.363 7fe4f5bee700 -1 log_channel(cluster) log [ERR] : 
> > deep-scrub 17.2b9 17:9d6cee16:
> > ::10008536718.00000000:head : on disk size (5883) does not match object 
> > info size (3505) adjusted for o
> > ndisk to (3505)
> > 2019-04-29 09:27:46.840 7fe4f5bee700 -1 log_channel(cluster) log [ERR] : 
> > 17.2b9 deep-scrub 4 errors
> >
> >
> > Pool 17 is a cephfs data pool, if that makes any difference.
> > And the two MDS's listed in versions are active:standby, not active:active.
> >
> > My question is whether I should attempt a `ceph pg repair <pgid>` to 
> > attempt a fix of these objects, or take another approach, as the object 
> > size mismatch appears to persist across all 3 copies of the PG(s).
> > I know that ceph pg repair can be dangerous in certain circumstances, so I 
> > want to feel confident in the operation before undertaking the repair.
> >
> > I did look at all underlying disks for these PG's for issues or errors, and 
> > none bubbled to the top, so I don't believe it to be a hardware issue in 
> > this case.
> >
> > Appreciate any help.
> >
> > Thanks,
> >
> > Reed
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> >
> >
> >
> > --
> > Cheers,
> > Brad
> >
> >
> >
> >
> > --
> > Cheers,
> > Brad
> >
> >
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Cheers,
Brad
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to