Re: [ceph-users] Inconsistent PG could not be repaired

Arvydas Opulskis Thu, 16 Aug 2018 06:07:25 -0700

Hi Thomas,

thanks for suggestion, but changing other objects or even object itself
doesn't helped out.


But I finally solved the problem:

1. Backed up problematic S3 object
2. Deleted it from S3
3. Stopped OSD
4. Flushed journal
5. Removed object directly from OSD
6. Started OSD
7. Repeated 3-6 steps on other OSD
8. Did deep-scrub on problematic PG (inconsistency went away)
9. Checked S3 bucket with --fix option
10. Put S3 object back via S3
11. Did deep-scrub, checked for object in OSD, etc., to be sure it exist
and can be accessed

Thanks, guys, for ideas!

Arvydas



On Tue, Aug 14, 2018 at 10:24 PM, Thomas White <tho...@thomaswhite.se>
wrote:

> Hi Arvydas,
>
>
>
> The error seems to suggest this is not an issue with your object data, but
> the expected object digest data. I am unable to access where I stored my
> very hacky diagnosis process for this, but our eventual fix was to locate
> the bucket or files affected and then rename an object within it, forcing a
> recalculation of the digest. Depending on the size of the pool perhaps it
> would be possible to randomly rename a few files to cause this
> recalculation to occur to see if this remedies it?
>
>
>
> Kind Regards,
>
>
>
> Tom
>
>
>
> *From:* ceph-users <ceph-users-boun...@lists.ceph.com> *On Behalf Of *Arvydas
> Opulskis
> *Sent:* 14 August 2018 12:33
> *To:* Brent Kennedy <bkenn...@cfl.rr.com>
> *Cc:* Ceph Users <ceph-users@lists.ceph.com>
>
> *Subject:* Re: [ceph-users] Inconsistent PG could not be repaired
>
>
>
> Thanks for suggestion about restarting OSD's, but this doesn't work either.
>
>
>
> Anyway, I managed to fix second unrepairing PG by getting object from OSD
> and saving it again via rados, but still no luck with first one.
>
> I think, I found main problem why this doesn't work. Seems, object is not
> overwritten, even rados command returns no errors. I tried to delete
> object, but it still stays in pool untouched. There is an example of what I
> see:
>
>
>
> # rados -p .rgw.buckets ls | grep -i "sha256__
> ce41e5246ead8bddd2a2b5bbb863db250f328be9dc5c3041481d778a32f8130d"
> default.142609570.87_20180203.020047/repositories/docker-
> local/yyy/company.yyy.api.assets/1.2.4/sha256__
> ce41e5246ead8bddd2a2b5bbb863db250f328be9dc5c3041481d778a32f8130d
>
> # rados -p .rgw.buckets get default.142609570.87_20180203.
> 020047/repositories/docker-local/yyy/company.yyy.api.assets/1.2.4/sha256__
> ce41e5246ead8bddd2a2b5bbb863db250f328be9dc5c3041481d778a32f8130d testfile
> error getting .rgw.buckets/default.142609570.87_20180203.020047/
> repositories/docker-local/yyy/company.yyy.api.assets/1.2.4/sha256__
> ce41e5246ead8bddd2a2b5bbb863db250f328be9dc5c3041481d778a32f8130d: (2) No
> such file or directory
>
> # rados -p .rgw.buckets rm default.142609570.87_20180203.
> 020047/repositories/docker-local/yyy/company.yyy.api.assets/1.2.4/sha256__
> ce41e5246ead8bddd2a2b5bbb863db250f328be9dc5c3041481d778a32f8130d
>
> # rados -p .rgw.buckets ls | grep -i "sha256__
> ce41e5246ead8bddd2a2b5bbb863db250f328be9dc5c3041481d778a32f8130d"
> default.142609570.87_20180203.020047/repositories/docker-
> local/yyy/company.yyy.api.assets/1.2.4/sha256__
> ce41e5246ead8bddd2a2b5bbb863db250f328be9dc5c3041481d778a32f8130d
>
>
>
> I've never seen this in our Ceph clusters before. Should I report a bug
> about it? If any of you guys need more diagnostic info - let me know.
>
>
>
> Thanks,
>
> Arvydas
>
>
>
> On Tue, Aug 7, 2018 at 5:49 PM, Brent Kennedy <bkenn...@cfl.rr.com> wrote:
>
> Last time I had an inconsistent PG that could not be repaired using the
> repair command, I looked at which OSDs hosted the PG, then restarted them
> one by one(usually stopping, waiting a few seconds, then starting them back
> up ).  You could also stop them, flush the journal, then start them back
> up.
>
>
>
> If that didn’t work, it meant there was data loss and I had to use the
> ceph-objectstore-tool repair tool to export the objects from a location
> that had the latest data and import into the one that had no data.  The
> ceph-objectstore-tool is not a simple thing though and should not be used
> lightly.  When I say data loss, I mean that ceph thinks the last place
> written has the data, that place being the OSD that doesn’t actually have
> the data(meaning it failed to write there).
>
>
>
> If you want to go that route, let me know, I wrote a how to on it.  Should
> be the last resort though.  I also don’t know your setup, so I would hate
> to recommend something so drastic.
>
>
>
> -Brent
>
>
>
> *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On Behalf
> Of *Arvydas Opulskis
> *Sent:* Monday, August 6, 2018 4:12 AM
> *To:* ceph-users@lists.ceph.com
> *Subject:* Re: [ceph-users] Inconsistent PG could not be repaired
>
>
>
> Hi again,
>
>
>
> after two weeks I've got another inconsistent PG in same cluster. OSD's
> are different from first PG, object can not be GET as well:
>
>
> # rados list-inconsistent-obj 26.821 --format=json-pretty
>
> {
>
>     "epoch": 178472,
>
>     "inconsistents": [
>
>         {
>
>             "object": {
>
>                 "name": "default.122888368.52__shadow_
> .3ubGZwLcz0oQ55-LTb7PCOTwKkv-nQf_7",
>
>                 "nspace": "",
>
>                 "locator": "",
>
>                 "snap": "head",
>
>                 "version": 118920
>
>             },
>
>             "errors": [],
>
>             "union_shard_errors": [
>
>                 "data_digest_mismatch_oi"
>
>             ],
>
>             "selected_object_info": "26:8411bae4:::default.
> 122888368.52__shadow_.3ubGZwLcz0oQ55-LTb7PCOTwKkv-nQf_7:head(126495'118920
> client.142609570.0:41412640 dirty|data_digest|omap_digest s 4194304 uv
> 118920 dd cd142aaa od ffffffff alloc_hint [0 0])",
>
>             "shards": [
>
>                 {
>
>                     "osd": 20,
>
>                     "errors": [
>
>                         "data_digest_mismatch_oi"
>
>                     ],
>
>                     "size": 4194304,
>
>                     "omap_digest": "0xffffffff",
>
>                     "data_digest": "0x6b102e59"
>
>                 },
>
>                 {
>
>                     "osd": 44,
>
>                     "errors": [
>
>                         "data_digest_mismatch_oi"
>
>                     ],
>
>                     "size": 4194304,
>
>                     "omap_digest": "0xffffffff",
>
>                     "data_digest": "0x6b102e59"
>
>                 }
>
>             ]
>
>         }
>
>     ]
>
> }
>
> # rados -p .rgw.buckets get default.122888368.52__shadow_.
> 3ubGZwLcz0oQ55-LTb7PCOTwKkv-nQf_7 test_2pg.file
>
> error getting .rgw.buckets/default.122888368.52__shadow_.
> 3ubGZwLcz0oQ55-LTb7PCOTwKkv-nQf_7: (5) Input/output error
>
>
>
>
>
> Still struggling how to solve it. Any ideas, guys?
>
>
>
> Thank you
>
>
>
>
>
>
>
> On Tue, Jul 24, 2018 at 10:27 AM, Arvydas Opulskis <zebedie...@gmail.com>
> wrote:
>
> Hello, Cephers,
>
>
>
> after trying different repair approaches I am out of ideas how to repair
> inconsistent PG. I hope, someones sharp eye will notice what I overlooked.
>
>
>
> Some info about cluster:
>
> Centos 7.4
>
> Jewel 10.2.10
>
> Pool size 2 (yes, I know it's a very bad choice)
>
> Pool with inconsistent PG: .rgw.buckets
>
>
>
> After routine deep-scrub I've found PG 26.c3f in inconsistent status.
> While running "ceph pg repair 26.c3f" command and monitoring "ceph -w" log,
> I noticed these errors:
>
> 2018-07-24 08:28:06.517042 osd.36 [ERR] 26.c3f shard 30: soid
> 26:fc32a1f1:::default.142609570.87_20180206.093111%
> 2frepositories%2fnuget-local%2fApplication%2fCompany.
> Application.Api%2fCompany.Application.Api.1.1.1.nupkg.
> artifactory-metadata%2fproperties.xml:head data_digest 0x540e4f8b !=
> data_digest 0x49a34c1f from auth oi 26:e261561a:::default.
> 168602061.10_team-xxx.xxx-jobs.H6.HADOOP.data-
> segmentation.application.131.xxx-jvm.cpu.load%2f2018-05-
> 05T03%3a51%3a39+00%3a00.sha1:head(167828'216051
> client.179334015.0:1847715760 dirty|data_digest|omap_digest s 40 uv 216051
> dd 49a34c1f od ffffffff alloc_hint [0 0])
>
>
>
> 2018-07-24 08:28:06.517118 osd.36 [ERR] 26.c3f shard 36: soid
> 26:fc32a1f1:::default.142609570.87_20180206.093111%
> 2frepositories%2fnuget-local%2fApplication%2fCompany.
> Application.Api%2fCompany.Application.Api.1.1.1.nupkg.
> artifactory-metadata%2fproperties.xml:head data_digest 0x540e4f8b !=
> data_digest 0x49a34c1f from auth oi 26:e261561a:::default.
> 168602061.10_team-xxx.xxx-jobs.H6.HADOOP.data-
> segmentation.application.131.xxx-jvm.cpu.load%2f2018-05-
> 05T03%3a51%3a39+00%3a00.sha1:head(167828'216051
> client.179334015.0:1847715760 dirty|data_digest|omap_digest s 40 uv 216051
> dd 49a34c1f od ffffffff alloc_hint [0 0])
>
>
>
> 2018-07-24 08:28:06.517122 osd.36 [ERR] 26.c3f soid 26:fc32a1f1:::default.
> 142609570.87_20180206.093111%2frepositories%2fnuget-local%
> 2fApplication%2fCompany.Application.Api%2fCompany.
> Application.Api.1.1.1.nupkg.artifactory-metadata%2fproperties.xml:head:
> failed to pick suitable auth object
>
>
>
> ...and same errors about another object on same PG.
>
>
>
> Repair failed, so I checked inconsistencies "rados list-inconsistent-obj
> 26.c3f --format=json-pretty":
>
>
>
> {
>
>     "epoch": 178403,
>
>     "inconsistents": [
>
>         {
>
>             "object": {
>
>                 "name": "default.142609570.87_
> 20180203.020047\/repositories\/docker-local\/yyy\/company.
> yyy.api.assets\/1.2.4\/sha256__ce41e5246ead8bddd2a2b5bbb863db
> 250f328be9dc5c3041481d778a32f8130d",
>
>                 "nspace": "",
>
>                 "locator": "",
>
>                 "snap": "head",
>
>                 "version": 217749
>
>             },
>
>             "errors": [],
>
>             "union_shard_errors": [
>
>                 "data_digest_mismatch_oi"
>
>             ],
>
>             "selected_object_info": "26:f4ce1748:::default.
> 168602061.10_team-xxx.xxx-jobs.H6.HADOOP.data-
> segmentation.application.131.xxx-jvm.cpu.load%2f2018-05-
> 08T03%3a45%3a15+00%3a00.sha1:head(167944'217749
> client.177936559.0:1884719302 dirty|data_digest|omap_digest s 40 uv 217749
> dd 422f251b od ffffffff alloc_hint [0 0])",
>
>             "shards": [
>
>                 {
>
>                     "osd": 30,
>
>                     "errors": [
>
>                         "data_digest_mismatch_oi"
>
>                     ],
>
>                     "size": 40,
>
>                     "omap_digest": "0xffffffff",
>
>                     "data_digest": "0x551c282f"
>
>                 },
>
>                 {
>
>                     "osd": 36,
>
>                     "errors": [
>
>                         "data_digest_mismatch_oi"
>
>                     ],
>
>                     "size": 40,
>
>                     "omap_digest": "0xffffffff",
>
>                     "data_digest": "0x551c282f"
>
>                 }
>
>             ]
>
>         },
>
>         {
>
>             "object": {
>
>                 "name": "default.142609570.87_
> 20180206.093111\/repositories\/nuget-local\/Application\/
> Company.Application.Api\/Company.Application.Api.1.1.1.
> nupkg.artifactory-metadata\/properties.xml",
>
>                 "nspace": "",
>
>                 "locator": "",
>
>                 "snap": "head",
>
>                 "version": 216051
>
>             },
>
>             "errors": [],
>
>             "union_shard_errors": [
>
>                 "data_digest_mismatch_oi"
>
>             ],
>
>             "selected_object_info": "26:e261561a:::default.
> 168602061.10_team-xxx.xxx-jobs.H6.HADOOP.data-
> segmentation.application.131.xxx-jvm.cpu.load%2f2018-05-
> 05T03%3a51%3a39+00%3a00.sha1:head(167828'216051
> client.179334015.0:1847715760 dirty|data_digest|omap_digest s 40 uv 216051
> dd 49a34c1f od ffffffff alloc_hint [0 0])",
>
>             "shards": [
>
>                 {
>
>                     "osd": 30,
>
>                     "errors": [
>
>                         "data_digest_mismatch_oi"
>
>                     ],
>
>                     "size": 40,
>
>                     "omap_digest": "0xffffffff",
>
>                     "data_digest": "0x540e4f8b"
>
>                 },
>
>                 {
>
>                     "osd": 36,
>
>                     "errors": [
>
>                         "data_digest_mismatch_oi"
>
>                     ],
>
>                     "size": 40,
>
>                     "omap_digest": "0xffffffff",
>
>                     "data_digest": "0x540e4f8b"
>
>                 }
>
>             ]
>
>         }
>
>     ]
>
> }
>
>
>
>
>
> After some reading, I understand, I needed rados get/put trick to solve
> this problem. I couldn't do rados get, because I was getting "no such file"
> error, even objects were listed by "rados ls" command, so I got them
> directly from OSD. After putting them back to rados (rados commands doesn't
> returned any errors) and doing deep-scrub on same PG, problem still
> existed. The only thing changed - when I try to get object via rados now I
> get "(5) Input/output error".
>
>
>
> I tried force object size to 40 (it's real size of both objects) by adding
> "-o 40" option to "rados put" command, but with no luck.
>
>
>
> Guys, maybe you have other ideas what to try? Why overwriting object
> doesn't solve this problem?
>
>
>
> Thanks a lot!
>
>
>
> Arvydas
>
>
>
>
>
>
>
>
>
>
>
>
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Inconsistent PG could not be repaired

Reply via email to