On 20.09.2017 16:49, hjcho616 wrote:
Anyone? Can this page be saved? If not what are my options?
Regards,
Hong
On Saturday, September 16, 2017 1:55 AM, hjcho616 <hjcho...@yahoo.com>
wrote:
Looking better... working on scrubbing..
HEALTH_ERR 1 pgs are stuck inactive for more than 300 seconds; 1 pgs
incomplete; 12 pgs inconsistent; 2 pgs repair; 1 pgs stuck inactive; 1
pgs stuck unclean; 109 scrub errors; too few PGs per OSD (29 < min
30); mds rank 0 has failed; mds cluster is degraded; noout flag(s)
set; no legacy OSD present but 'sortbitwise' flag is not set
Now PG1.28.. looking at all old osds dead or alive. Only one with
DIR_* directory is in osd.4. This appears to be metadata pool! 21M of
metadata can be quite a bit of stuff.. so I would like to rescue this!
But I am not able to start this OSD. exporting through
ceph-objectstore-tool appears to crash. Even with
--skip-journal-replay and --skip-mount-omap (different failure). As I
mentioned in earlier email, that exception thrown message is bogus...
# ceph-objectstore-tool --op export --pgid 1.28 --data-path
/var/lib/ceph/osd/ceph-4 --journal-path
/var/lib/ceph/osd/ceph-4/journal --file ~/1.28.export
terminate called after throwing an instance of 'std::domain_error'
[SNIP]
What can I do to save that PG1.28? Please let me know if you need
more information. So close!... =)
Regards,
Hong
12 inconsistent and 109 scrub errors is something you should fix first
of all.
also you can consider using the paid-services of many ceph support
companies. that specialize in these kind of situations.
--
that beeing said, here are some suggestions...
when it comes to lost object recovery you have come about as far as i
have ever experienced. so everything after here is just assumptions and
wild guesswork to what you can try. I hope others shouts out if i tell
you wildly wrong things.
if you have found date pg1.28 from the broken osd and have checked all
other working and nonworking drives, for that pg. then you need to try
and extract the pg from the broken drive. As always in recovery cases,
take a dd clone of the drive and work from the cloned image. to avoid
more damage to the drive, and to allow you to try multiple times.
you should add a temporary injection drive large enough for that pg, and
set its crush weight to 0 so it always drains. make sure it is up and
registered properly in ceph.
the idea is to copy the pg manually from broken-osd to the injection
drive, since the export/import fails.. making sure you get all xattrs
included. one can either copy the whole pg, or just the "missing"
objects. if there are few objects i would go for that, if there are
many i would take the whole pg. you wont get data from leveldb. so i am
not at all sure this would work. but worth a shot.
- stop your injection osd, verify it is down and the proccess not running.
- from the mountpoint of your broken-osd go into the current directory.
and tar up the pg1.28 make sure you use -p and --xattrs when you create
the archive.
- if tar errors out on unreadable files, just rm those (since you are
working on a copy of your rescue image, you can allways try again)
- copy the tar file to the injection drive and extract while sitting in
the current directory (remember --xattrs)
- set debug options on the injection drive in ceph.conf
- start the injection drive, and follow along in the log file. hopefully
it should scan, locate the pg, and replicate the pg1.28 objects off to
the current primary drive for pg1.28. and since it have crush weight 0
it should drain out.
- if that works, verify the injection drive is drained, stop it and
remove it from ceph. zap the drive.
this is all as i said guesstimates so your mileage may vary
good luck
Ronny Aasen
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com