OK, that's good (as far is it goes, being a manual process). So then, back to what I think was Mihály's original issue:
> pg repair or deep-scrub can not fix this issue. But if I > understand correctly, osd has to known it can not retrieve > object from osd.0 and need to be replicate an another osd > because there is no 3 working replicas now. Given a bad checksum and/or read error tells ceph that an object is corrupt, it would seem to be a natural step to then have ceph automatically use another good-checksum copy, and even rewrite the corrupt object, either in normal operation or under a scub or repair. Is there a reason this isn't done, apart from lack of tuits? Cheers, Chris On Mon, Nov 18, 2013 at 11:43:46AM -0800, David Zafman wrote: > > No, you wouldn’t need to re-replicate the whole disk for a single bad sector. > The way to deal with that if the object is on the primary is to remove the > file manually from the OSD’s filesystem and perform a repair of the PG that > holds that object. This will copy the object back from one of the replicas. > > David > > On Nov 17, 2013, at 10:46 PM, Chris Dunlop <ch...@onthe.net.au> wrote: > >> Hi David, >> >> On Fri, Nov 15, 2013 at 10:00:37AM -0800, David Zafman wrote: >>> >>> Replication does not occur until the OSD is “out.” This creates a new >>> mapping in the cluster of where the PGs should be and thus data begins to >>> move and/or create sufficient copies. This scheme lets you control how and >>> when you want the replication to occur. If you have plenty of space and >>> you aren’t going to replace the drive immediately, just mark the OSD “down" >>> AND “out.". If you are going to replace the drive immediately, set the >>> “noout” flag. Take the OSD “down” and replace drive. Assuming it is >>> mounted in the same place as the bad drive, bring the OSD back up. This >>> will replicate exactly the same PGs the bad drive held back to the >>> replacement drive. As was stated before don’t forget to “ceph osd unset >>> noout" >>> >>> Keep in mind that in the case of a machine that has a hardware failure and >>> takes OSD(s) down there is an automatic timeout which will mark them “out" >>> for unattended operation. Unless you are monitoring the cluster 24/7 you >>> should have enough disk space available to handle failures. >>> >>> Related info in: >>> >>> http://ceph.com/docs/master/rados/troubleshooting/troubleshooting-osd/ >>> >>> David Zafman >>> Senior Developer >>> http://www.inktank.com >> >> >> Are you saying, if a disk suffers from a bad sector in an object >> for which it's primary, and for which good data exists on other >> replica PGs, there's no way for ceph to recover other than by >> (re-)replicating the whole disk? >> >> I.e., even if the disk is able to remap the bad sector using a >> spare, so the disk is ok (albeit missing a sector's worth of >> object data), the only way to recover is to basically blow away >> all the data on that disk and start again, replicating >> everything back to the disk (or to other disks)? >> >> Cheers, >> >> Chris. _______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com