I can also confirm that after upgrading to firefly both of our clusters (test 
and live) were going from 0 scrub errors each for about 6 Month to about 9-12 
per week...
This also makes me kind of nervous, since as far as I know everything "ceph pg 
repair" does, is to copy the primary object to all replicas, no matter which 
object is the correct one.
Of course the described method of manual checking works (for pools with more 
than 2 replicas), but doing this in a large cluster nearly every week is 
horribly timeconsuming and error prone.
It would be great to get an explanation for the increased numbers of scrub 
errors since firefly. Were they just not detected correctly in previous 
versions? Or is there maybe something wrong with the new code?

Acutally, our company is currently preventing our projects to move to ceph 
because of this problem.

Regards,
Christian
________________________________
Von: ceph-users [ceph-users-boun...@lists.ceph.com]" im Auftrag von "Travis 
Rhoden [trho...@gmail.com]
Gesendet: Donnerstag, 10. Juli 2014 16:24
An: Gregory Farnum
Cc: ceph-users@lists.ceph.com
Betreff: Re: [ceph-users] scrub error on firefly

And actually just to follow-up, it does seem like there are some additional 
smarts beyond just using the primary to overwrite the secondaries...  Since I 
captured md5 sums before and after the repair, I can say that in this 
particular instance, the secondary copy was used to overwrite the primary.  So, 
I'm just trusting Ceph to the right thing, and so far it seems to, but the 
comments here about needing to determine the correct object and place it on the 
primary PG make me wonder if I've been missing something.

 - Travis


On Thu, Jul 10, 2014 at 10:19 AM, Travis Rhoden 
<trho...@gmail.com<mailto:trho...@gmail.com>> wrote:
I can also say that after a recent upgrade to Firefly, I have experienced 
massive uptick in scrub errors.  The cluster was on cuttlefish for about a 
year, and had maybe one or two scrub errors.  After upgrading to Firefly, we've 
probably seen 3 to 4 dozen in the last month or so (was getting 2-3 a day for a 
few weeks until the whole cluster was rescrubbed, it seemed).

What I cannot determine, however, is how to know which object is busted?  For 
example, just today I ran into a scrub error.  The object has two copies and is 
an 8MB piece of an RBD, and has identical timestamps, identical xattrs names 
and values.  But it definitely has a different MD5 sum. How to know which one 
is correct?

I've been just kicking off pg repair each time, which seems to just use the 
primary copy to overwrite the others.  Haven't run into any issues with that so 
far, but it does make me nervous.

 - Travis


On Tue, Jul 8, 2014 at 1:06 AM, Gregory Farnum 
<g...@inktank.com<mailto:g...@inktank.com>> wrote:
It's not very intuitive or easy to look at right now (there are plans
from the recent developer summit to improve things), but the central
log should have output about exactly what objects are busted. You'll
then want to compare the copies manually to determine which ones are
good or bad, get the good copy on the primary (make sure you preserve
xattrs), and run repair.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Mon, Jul 7, 2014 at 6:48 PM, Randy Smith 
<rbsm...@adams.edu<mailto:rbsm...@adams.edu>> wrote:
> Greetings,
>
> I upgraded to firefly last week and I suddenly received this error:
>
> health HEALTH_ERR 1 pgs inconsistent; 1 scrub errors
>
> ceph health detail shows the following:
>
> HEALTH_ERR 1 pgs inconsistent; 1 scrub errors
> pg 3.c6 is active+clean+inconsistent, acting [2,5]
> 1 scrub errors
>
> The docs say that I can run `ceph pg repair 3.c6` to fix this. What I want
> to know is what are the risks of data loss if I run that command in this
> state and how can I mitigate them?
>
> --
> Randall Smith
> Computing Services
> Adams State University
> http://www.adams.edu/
> 719-587-7741<tel:719-587-7741>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to