Hi Thomas, thanks for suggestion, but changing other objects or even object itself doesn't helped out.
But I finally solved the problem: 1. Backed up problematic S3 object 2. Deleted it from S3 3. Stopped OSD 4. Flushed journal 5. Removed object directly from OSD 6. Started OSD 7. Repeated 3-6 steps on other OSD 8. Did deep-scrub on problematic PG (inconsistency went away) 9. Checked S3 bucket with --fix option 10. Put S3 object back via S3 11. Did deep-scrub, checked for object in OSD, etc., to be sure it exist and can be accessed Thanks, guys, for ideas! Arvydas On Tue, Aug 14, 2018 at 10:24 PM, Thomas White <tho...@thomaswhite.se> wrote: > Hi Arvydas, > > > > The error seems to suggest this is not an issue with your object data, but > the expected object digest data. I am unable to access where I stored my > very hacky diagnosis process for this, but our eventual fix was to locate > the bucket or files affected and then rename an object within it, forcing a > recalculation of the digest. Depending on the size of the pool perhaps it > would be possible to randomly rename a few files to cause this > recalculation to occur to see if this remedies it? > > > > Kind Regards, > > > > Tom > > > > *From:* ceph-users <ceph-users-boun...@lists.ceph.com> *On Behalf Of *Arvydas > Opulskis > *Sent:* 14 August 2018 12:33 > *To:* Brent Kennedy <bkenn...@cfl.rr.com> > *Cc:* Ceph Users <ceph-users@lists.ceph.com> > > *Subject:* Re: [ceph-users] Inconsistent PG could not be repaired > > > > Thanks for suggestion about restarting OSD's, but this doesn't work either. > > > > Anyway, I managed to fix second unrepairing PG by getting object from OSD > and saving it again via rados, but still no luck with first one. > > I think, I found main problem why this doesn't work. Seems, object is not > overwritten, even rados command returns no errors. I tried to delete > object, but it still stays in pool untouched. There is an example of what I > see: > > > > # rados -p .rgw.buckets ls | grep -i "sha256__ > ce41e5246ead8bddd2a2b5bbb863db250f328be9dc5c3041481d778a32f8130d" > default.142609570.87_20180203.020047/repositories/docker- > local/yyy/company.yyy.api.assets/1.2.4/sha256__ > ce41e5246ead8bddd2a2b5bbb863db250f328be9dc5c3041481d778a32f8130d > > # rados -p .rgw.buckets get default.142609570.87_20180203. > 020047/repositories/docker-local/yyy/company.yyy.api.assets/1.2.4/sha256__ > ce41e5246ead8bddd2a2b5bbb863db250f328be9dc5c3041481d778a32f8130d testfile > error getting .rgw.buckets/default.142609570.87_20180203.020047/ > repositories/docker-local/yyy/company.yyy.api.assets/1.2.4/sha256__ > ce41e5246ead8bddd2a2b5bbb863db250f328be9dc5c3041481d778a32f8130d: (2) No > such file or directory > > # rados -p .rgw.buckets rm default.142609570.87_20180203. > 020047/repositories/docker-local/yyy/company.yyy.api.assets/1.2.4/sha256__ > ce41e5246ead8bddd2a2b5bbb863db250f328be9dc5c3041481d778a32f8130d > > # rados -p .rgw.buckets ls | grep -i "sha256__ > ce41e5246ead8bddd2a2b5bbb863db250f328be9dc5c3041481d778a32f8130d" > default.142609570.87_20180203.020047/repositories/docker- > local/yyy/company.yyy.api.assets/1.2.4/sha256__ > ce41e5246ead8bddd2a2b5bbb863db250f328be9dc5c3041481d778a32f8130d > > > > I've never seen this in our Ceph clusters before. Should I report a bug > about it? If any of you guys need more diagnostic info - let me know. > > > > Thanks, > > Arvydas > > > > On Tue, Aug 7, 2018 at 5:49 PM, Brent Kennedy <bkenn...@cfl.rr.com> wrote: > > Last time I had an inconsistent PG that could not be repaired using the > repair command, I looked at which OSDs hosted the PG, then restarted them > one by one(usually stopping, waiting a few seconds, then starting them back > up ). You could also stop them, flush the journal, then start them back > up. > > > > If that didn’t work, it meant there was data loss and I had to use the > ceph-objectstore-tool repair tool to export the objects from a location > that had the latest data and import into the one that had no data. The > ceph-objectstore-tool is not a simple thing though and should not be used > lightly. When I say data loss, I mean that ceph thinks the last place > written has the data, that place being the OSD that doesn’t actually have > the data(meaning it failed to write there). > > > > If you want to go that route, let me know, I wrote a how to on it. Should > be the last resort though. I also don’t know your setup, so I would hate > to recommend something so drastic. > > > > -Brent > > > > *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On Behalf > Of *Arvydas Opulskis > *Sent:* Monday, August 6, 2018 4:12 AM > *To:* ceph-users@lists.ceph.com > *Subject:* Re: [ceph-users] Inconsistent PG could not be repaired > > > > Hi again, > > > > after two weeks I've got another inconsistent PG in same cluster. OSD's > are different from first PG, object can not be GET as well: > > > # rados list-inconsistent-obj 26.821 --format=json-pretty > > { > > "epoch": 178472, > > "inconsistents": [ > > { > > "object": { > > "name": "default.122888368.52__shadow_ > .3ubGZwLcz0oQ55-LTb7PCOTwKkv-nQf_7", > > "nspace": "", > > "locator": "", > > "snap": "head", > > "version": 118920 > > }, > > "errors": [], > > "union_shard_errors": [ > > "data_digest_mismatch_oi" > > ], > > "selected_object_info": "26:8411bae4:::default. > 122888368.52__shadow_.3ubGZwLcz0oQ55-LTb7PCOTwKkv-nQf_7:head(126495'118920 > client.142609570.0:41412640 dirty|data_digest|omap_digest s 4194304 uv > 118920 dd cd142aaa od ffffffff alloc_hint [0 0])", > > "shards": [ > > { > > "osd": 20, > > "errors": [ > > "data_digest_mismatch_oi" > > ], > > "size": 4194304, > > "omap_digest": "0xffffffff", > > "data_digest": "0x6b102e59" > > }, > > { > > "osd": 44, > > "errors": [ > > "data_digest_mismatch_oi" > > ], > > "size": 4194304, > > "omap_digest": "0xffffffff", > > "data_digest": "0x6b102e59" > > } > > ] > > } > > ] > > } > > # rados -p .rgw.buckets get default.122888368.52__shadow_. > 3ubGZwLcz0oQ55-LTb7PCOTwKkv-nQf_7 test_2pg.file > > error getting .rgw.buckets/default.122888368.52__shadow_. > 3ubGZwLcz0oQ55-LTb7PCOTwKkv-nQf_7: (5) Input/output error > > > > > > Still struggling how to solve it. Any ideas, guys? > > > > Thank you > > > > > > > > On Tue, Jul 24, 2018 at 10:27 AM, Arvydas Opulskis <zebedie...@gmail.com> > wrote: > > Hello, Cephers, > > > > after trying different repair approaches I am out of ideas how to repair > inconsistent PG. I hope, someones sharp eye will notice what I overlooked. > > > > Some info about cluster: > > Centos 7.4 > > Jewel 10.2.10 > > Pool size 2 (yes, I know it's a very bad choice) > > Pool with inconsistent PG: .rgw.buckets > > > > After routine deep-scrub I've found PG 26.c3f in inconsistent status. > While running "ceph pg repair 26.c3f" command and monitoring "ceph -w" log, > I noticed these errors: > > 2018-07-24 08:28:06.517042 osd.36 [ERR] 26.c3f shard 30: soid > 26:fc32a1f1:::default.142609570.87_20180206.093111% > 2frepositories%2fnuget-local%2fApplication%2fCompany. > Application.Api%2fCompany.Application.Api.1.1.1.nupkg. > artifactory-metadata%2fproperties.xml:head data_digest 0x540e4f8b != > data_digest 0x49a34c1f from auth oi 26:e261561a:::default. > 168602061.10_team-xxx.xxx-jobs.H6.HADOOP.data- > segmentation.application.131.xxx-jvm.cpu.load%2f2018-05- > 05T03%3a51%3a39+00%3a00.sha1:head(167828'216051 > client.179334015.0:1847715760 dirty|data_digest|omap_digest s 40 uv 216051 > dd 49a34c1f od ffffffff alloc_hint [0 0]) > > > > 2018-07-24 08:28:06.517118 osd.36 [ERR] 26.c3f shard 36: soid > 26:fc32a1f1:::default.142609570.87_20180206.093111% > 2frepositories%2fnuget-local%2fApplication%2fCompany. > Application.Api%2fCompany.Application.Api.1.1.1.nupkg. > artifactory-metadata%2fproperties.xml:head data_digest 0x540e4f8b != > data_digest 0x49a34c1f from auth oi 26:e261561a:::default. > 168602061.10_team-xxx.xxx-jobs.H6.HADOOP.data- > segmentation.application.131.xxx-jvm.cpu.load%2f2018-05- > 05T03%3a51%3a39+00%3a00.sha1:head(167828'216051 > client.179334015.0:1847715760 dirty|data_digest|omap_digest s 40 uv 216051 > dd 49a34c1f od ffffffff alloc_hint [0 0]) > > > > 2018-07-24 08:28:06.517122 osd.36 [ERR] 26.c3f soid 26:fc32a1f1:::default. > 142609570.87_20180206.093111%2frepositories%2fnuget-local% > 2fApplication%2fCompany.Application.Api%2fCompany. > Application.Api.1.1.1.nupkg.artifactory-metadata%2fproperties.xml:head: > failed to pick suitable auth object > > > > ...and same errors about another object on same PG. > > > > Repair failed, so I checked inconsistencies "rados list-inconsistent-obj > 26.c3f --format=json-pretty": > > > > { > > "epoch": 178403, > > "inconsistents": [ > > { > > "object": { > > "name": "default.142609570.87_ > 20180203.020047\/repositories\/docker-local\/yyy\/company. > yyy.api.assets\/1.2.4\/sha256__ce41e5246ead8bddd2a2b5bbb863db > 250f328be9dc5c3041481d778a32f8130d", > > "nspace": "", > > "locator": "", > > "snap": "head", > > "version": 217749 > > }, > > "errors": [], > > "union_shard_errors": [ > > "data_digest_mismatch_oi" > > ], > > "selected_object_info": "26:f4ce1748:::default. > 168602061.10_team-xxx.xxx-jobs.H6.HADOOP.data- > segmentation.application.131.xxx-jvm.cpu.load%2f2018-05- > 08T03%3a45%3a15+00%3a00.sha1:head(167944'217749 > client.177936559.0:1884719302 dirty|data_digest|omap_digest s 40 uv 217749 > dd 422f251b od ffffffff alloc_hint [0 0])", > > "shards": [ > > { > > "osd": 30, > > "errors": [ > > "data_digest_mismatch_oi" > > ], > > "size": 40, > > "omap_digest": "0xffffffff", > > "data_digest": "0x551c282f" > > }, > > { > > "osd": 36, > > "errors": [ > > "data_digest_mismatch_oi" > > ], > > "size": 40, > > "omap_digest": "0xffffffff", > > "data_digest": "0x551c282f" > > } > > ] > > }, > > { > > "object": { > > "name": "default.142609570.87_ > 20180206.093111\/repositories\/nuget-local\/Application\/ > Company.Application.Api\/Company.Application.Api.1.1.1. > nupkg.artifactory-metadata\/properties.xml", > > "nspace": "", > > "locator": "", > > "snap": "head", > > "version": 216051 > > }, > > "errors": [], > > "union_shard_errors": [ > > "data_digest_mismatch_oi" > > ], > > "selected_object_info": "26:e261561a:::default. > 168602061.10_team-xxx.xxx-jobs.H6.HADOOP.data- > segmentation.application.131.xxx-jvm.cpu.load%2f2018-05- > 05T03%3a51%3a39+00%3a00.sha1:head(167828'216051 > client.179334015.0:1847715760 dirty|data_digest|omap_digest s 40 uv 216051 > dd 49a34c1f od ffffffff alloc_hint [0 0])", > > "shards": [ > > { > > "osd": 30, > > "errors": [ > > "data_digest_mismatch_oi" > > ], > > "size": 40, > > "omap_digest": "0xffffffff", > > "data_digest": "0x540e4f8b" > > }, > > { > > "osd": 36, > > "errors": [ > > "data_digest_mismatch_oi" > > ], > > "size": 40, > > "omap_digest": "0xffffffff", > > "data_digest": "0x540e4f8b" > > } > > ] > > } > > ] > > } > > > > > > After some reading, I understand, I needed rados get/put trick to solve > this problem. I couldn't do rados get, because I was getting "no such file" > error, even objects were listed by "rados ls" command, so I got them > directly from OSD. After putting them back to rados (rados commands doesn't > returned any errors) and doing deep-scrub on same PG, problem still > existed. The only thing changed - when I try to get object via rados now I > get "(5) Input/output error". > > > > I tried force object size to 40 (it's real size of both objects) by adding > "-o 40" option to "rados put" command, but with no luck. > > > > Guys, maybe you have other ideas what to try? Why overwriting object > doesn't solve this problem? > > > > Thanks a lot! > > > > Arvydas > > > > > > > > > > > > >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com