Re: [ceph-users] Power outages!!! help!

Ronny Aasen Wed, 20 Sep 2017 14:04:49 -0700

i would only tar the pg you have missing objects from, trying to injectolder objects when the pg is correct can not be good.

scrub errors is kind of the issue with only 2 replicas. when you have 2different objects. how to know witch one is correct and witch one is bad..and as you have read onhttp://ceph.com/geen-categorie/ceph-manually-repair-object/ and onhttp://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/you need to


- find the pg      ::  rados list-inconsistent-pg [pool]

- find the problem :: rados list-inconsistent-obj 0.6--format=json-pretty ; give you the object name look for hints to whatis the bad object- find the object :: manually check the objects, check the objectmetadata, run md5sum on them all and compare. check objects on thenonrunning osd's and compare there as well. anything to try to determinewhat object is ok and what is bad.- fix the problem :: assuming you find the bad object, stop theaffected osd with the bad object, remove the object manually, restartosd. issue repair command.

if the rados commands does not give you the info you need to do it allmanually as on http://ceph.com/geen-categorie/ceph-manually-repair-object/


good luck
Ronny Aasen

On 20.09.2017 22:17, hjcho616 wrote:

Thanks Ronny.
I decided to try to tar everything under current directory. Is thiscorrect command for it? Is there any directory we do not want in thenew drive? commit_op_seq, meta, nosnap, omap?
tar --xattrs --preserve-permissions -zcvf osd.4.tar.gz .
As far as inconsistent PGs... I am running in to these errors. Itried moving one copy of pg to other location, but it just says movedshard is missing. Tried setting 'noout ' and turn one of them down,seems to work on something but then back to same error. Currentlytrying to move to different osd... making sure the drive is notfaulty, got few of them.. but still persisting.. I've been kickingoff ceph pg repair PG#, hoping it would fix them. =P Any othersuggestion?
2017-09-20 09:39:48.481400 7f163c5fa700 0 log_channel(cluster) log[INF] : 0.29 repair starts2017-09-20 09:47:37.384921 7f163c5fa700 -1 log_channel(cluster) log[ERR] : 0.29 shard 6: soid 0:97126ead:::200014ce4c3.0000028f:headdata_digest 0x8f679a50 != data_digest 0x979f2ed4 from auth oi0:97126ead:::200014ce4c3.0000028f:head(19366'539375client.535319.1:2361163 dirty|data_digest|omap_digest s 4194304 uv539375 dd 979f2ed4 od ffffffff alloc_hint [0 0])2017-09-20 09:47:37.384931 7f163c5fa700 -1 log_channel(cluster) log[ERR] : 0.29 shard 7: soid 0:97126ead:::200014ce4c3.0000028f:headdata_digest 0x8f679a50 != data_digest 0x979f2ed4 from auth oi0:97126ead:::200014ce4c3.0000028f:head(19366'539375client.535319.1:2361163 dirty|data_digest|omap_digest s 4194304 uv539375 dd 979f2ed4 od ffffffff alloc_hint [0 0])2017-09-20 09:47:37.384936 7f163c5fa700 -1 log_channel(cluster) log[ERR] : 0.29 soid 0:97126ead:::200014ce4c3.0000028f:head: failed topick suitable auth object2017-09-20 09:48:11.138566 7f1639df5700 -1 log_channel(cluster) log[ERR] : 0.29 shard 6: soid 0:97d5c15a:::100000101b4.00006892:headdata_digest 0xd65b4014 != data_digest 0xf41cfab8 from auth oi0:97d5c15a:::100000101b4.00006892:head(12962'65557 osd.4.0:42234dirty|data_digest|omap_digest s 4194304 uv 776 dd f41cfab8 od ffffffffalloc_hint [0 0])2017-09-20 09:48:11.138575 7f1639df5700 -1 log_channel(cluster) log[ERR] : 0.29 shard 7: soid 0:97d5c15a:::100000101b4.00006892:headdata_digest 0xd65b4014 != data_digest 0xf41cfab8 from auth oi0:97d5c15a:::100000101b4.00006892:head(12962'65557 osd.4.0:42234dirty|data_digest|omap_digest s 4194304 uv 776 dd f41cfab8 od ffffffffalloc_hint [0 0])2017-09-20 09:48:11.138581 7f1639df5700 -1 log_channel(cluster) log[ERR] : 0.29 soid 0:97d5c15a:::100000101b4.00006892:head: failed topick suitable auth object2017-09-20 09:48:55.584022 7f1639df5700 -1 log_channel(cluster) log[ERR] : 0.29 repair 4 errors, 0 fixed
Latest health...
HEALTH_ERR 1 pgs are stuck inactive for more than 300 seconds; 1 pgsdown; 1 pgs incomplete; 9 pgs inconsistent; 1 pgs repair; 1 pgs stuckinactive; 1 pgs stuck unclean; 68 scrub errors; mds rank 0 has failed;mds cluster is degraded; no legacy OSD present but 'sortbitwise' flagis not set
Regards,
Hong
On Wednesday, September 20, 2017 11:53 AM, Ronny Aasen<ronny+ceph-us...@aasen.cx> wrote:
On 20.09.2017 16:49, hjcho616 wrote:
Anyone?  Can this page be saved?  If not what are my options?

Regards,
Hong
On Saturday, September 16, 2017 1:55 AM, hjcho616<hjcho...@yahoo.com> <mailto:hjcho...@yahoo.com> wrote:
Looking better... working on scrubbing..
HEALTH_ERR 1 pgs are stuck inactive for more than 300 seconds; 1 pgsincomplete; 12 pgs inconsistent; 2 pgs repair; 1 pgs stuck inactive;1 pgs stuck unclean; 109 scrub errors; too few PGs per OSD (29 < min30); mds rank 0 has failed; mds cluster is degraded; noout flag(s)set; no legacy OSD present but 'sortbitwise' flag is not set
Now PG1.28.. looking at all old osds dead or alive. Only one withDIR_* directory is in osd.4. This appears to be metadata pool! 21Mof metadata can be quite a bit of stuff.. so I would like to rescuethis! But I am not able to start this OSD. exporting throughceph-objectstore-tool appears to crash. Even with--skip-journal-replay and --skip-mount-omap (different failure). AsI mentioned in earlier email, that exception thrown message is bogus...# ceph-objectstore-tool --op export --pgid 1.28 --data-path/var/lib/ceph/osd/ceph-4 --journal-path/var/lib/ceph/osd/ceph-4/journal --file ~/1.28.export
terminate called after throwing an instance of 'std::domain_error'
[SNIP]
What can I do to save that PG1.28? Please let me know if you needmore information. So close!... =)
Regards,
Hong
12 inconsistent and 109 scrub errors is something you should fix firstof all.also you can consider using the paid-services of many ceph supportcompanies. that specialize in these kind of situations.
--
that beeing said, here are some suggestions...
when it comes to lost object recovery you have come about as far as ihave ever experienced. so everything after here is just assumptionsand wild guesswork to what you can try. I hope others shouts out if itell you wildly wrong things.if you have found date pg1.28 from the broken osd and have checked allother working and nonworking drives, for that pg. then you need to tryand extract the pg from the broken drive. As always in recovery cases,take a dd clone of the drive and work from the cloned image. to avoidmore damage to the drive, and to allow you to try multiple times.you should add a temporary injection drive large enough for that pg,and set its crush weight to 0 so it always drains. make sure it is upand registered properly in ceph.the idea is to copy the pg manually from broken-osd to the injectiondrive, since the export/import fails.. making sure you get all xattrsincluded. one can either copy the whole pg, or just the "missing"objects. if there are few objects i would go for that, if there aremany i would take the whole pg. you wont get data from leveldb. so iam not at all sure this would work. but worth a shot.
- stop your injection osd, verify it is down and the proccess not running.
- from the mountpoint of your broken-osd go into the currentdirectory. and tar up the pg1.28 make sure you use -p and --xattrswhen you create the archive.- if tar errors out on unreadable files, just rm those (since you areworking on a copy of your rescue image, you can allways try again)- copy the tar file to the injection drive and extract while sittingin the current directory (remember --xattrs)
- set debug options on the injection drive in ceph.conf
- start the injection drive, and follow along in the log file.hopefully it should scan, locate the pg, and replicate the pg1.28objects off to the current primary drive for pg1.28. and since it havecrush weight 0 it should drain out.- if that works, verify the injection drive is drained, stop it andremove it from ceph. zap the drive.
this is all as i said guesstimates so your mileage may vary
good luck
Ronny Aasen

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Power outages!!! help!

Reply via email to