OK, that's good (as far is it goes, being a manual process).

So then, back to what I think was Mihály's original issue:

> pg repair or deep-scrub can not fix this issue. But if I
> understand correctly, osd has to known it can not retrieve
> object from osd.0 and need to be replicate an another osd
> because there is no 3 working replicas now.

Given a bad checksum and/or read error tells ceph that an object
is corrupt, it would seem to be a natural step to then have ceph
automatically use another good-checksum copy, and even rewrite
the corrupt object, either in normal operation or under a scub
or repair.

Is there a reason this isn't done, apart from lack of tuits?

Cheers,

Chris


On Mon, Nov 18, 2013 at 11:43:46AM -0800, David Zafman wrote:
> 
> No, you wouldn’t need to re-replicate the whole disk for a single bad sector. 
>  The way to deal with that if the object is on the primary is to remove the 
> file manually from the OSD’s filesystem and perform a repair of the PG that 
> holds that object.  This will copy the object back from one of the replicas.
> 
> David
> 
> On Nov 17, 2013, at 10:46 PM, Chris Dunlop <ch...@onthe.net.au> wrote:
> 
>> Hi David,
>> 
>> On Fri, Nov 15, 2013 at 10:00:37AM -0800, David Zafman wrote:
>>> 
>>> Replication does not occur until the OSD is “out.”  This creates a new 
>>> mapping in the cluster of where the PGs should be and thus data begins to 
>>> move and/or create sufficient copies.  This scheme lets you control how and 
>>> when you want the replication to occur.  If you have plenty of space and 
>>> you aren’t going to replace the drive immediately, just mark the OSD “down" 
>>> AND “out.".  If you are going to replace the drive immediately, set the 
>>> “noout” flag.  Take the OSD “down” and replace drive.  Assuming it is 
>>> mounted in the same place as the bad drive, bring the OSD back up.  This 
>>> will replicate exactly the same PGs the bad drive held back to the 
>>> replacement drive.  As was stated before don’t forget to “ceph osd unset 
>>> noout"
>>> 
>>> Keep in mind that in the case of a machine that has a hardware failure and 
>>> takes OSD(s) down there is an automatic timeout which will mark them “out" 
>>> for unattended operation.  Unless you are monitoring the cluster 24/7 you 
>>> should have enough disk space available to handle failures.
>>> 
>>> Related info in:
>>> 
>>> http://ceph.com/docs/master/rados/troubleshooting/troubleshooting-osd/
>>> 
>>> David Zafman
>>> Senior Developer
>>> http://www.inktank.com
>> 
>> 
>> Are you saying, if a disk suffers from a bad sector in an object
>> for which it's primary, and for which good data exists on other
>> replica PGs, there's no way for ceph to recover other than by
>> (re-)replicating the whole disk?
>> 
>> I.e., even if the disk is able to remap the bad sector using a
>> spare, so the disk is ok (albeit missing a sector's worth of
>> object data), the only way to recover is to basically blow away
>> all the data on that disk and start again, replicating
>> everything back to the disk (or to other disks)?
>> 
>> Cheers,
>> 
>> Chris.
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to