Hello,
First of all, I would recommend, that you use ceph pg repair wherever
you can.
When you have size=3 the cluster can compare 3 instances, therefore it
is easier for it to spot which two is good, and which one is bad.
When you use size=2 the case is harder for o-so-many ways:
-According to the documentation it is harder to determine which object
is the faulty.
-If an osd dies the increased load (caused by the missing osd) and the
extra io from the recovery process hits the other osd much harder,
increasing the chance that another osd dies (because of disk failure
caused by the sudden spike of extra load), and then you loose your data
-If there is a bitrot in the remaining one replica, then you do not have
any valid copies for your data
So, to summarize it, the experts say, that it is MUCH safer to have
size=3 min_size=2 (I'm far from an expert, I'm just quoting :))
So, back to the task at hand:
If you repaired all pgs that you coud by ceph pg repair, there is a
manual recovery process, (written for filestore unfortunately):
http://ceph.com/geen-categorie/ceph-manually-repair-object/
The good news is, that there is a fuse client for bluestore too, so you
can mount it by hand and repair it as per the linked document,
I think that you could ceph osd pool set [pool] size 3 yo increase the
copy count, but before that you should be certain that you have enough
free space, and you'll not hit the osd pg count limits.
[DISCLAIMER]:
I have never done this, and I too have questions about this topic:
[Questions to the list]
How is it possible that the cluster cannot repair itself with ceph pg
repair?
No good copies are remaining?
Cannot decide which copy is valid or up-to date?
If so, why not, when there is checksum, mtime for everything?
In this inconsistent state which object does the cluster serve when it
doesn't know which one is the valid?
Isn't there a way to do a more "online" repair?
A way to examine, remove objects while running the osd?
Or better yet, to tell the cluster that which copy should be used when
repairing?
There is a command, ceph pg force-recovery, but I cannot find
documentation for it.
Kind regards,
Denes Dolhay.
On 10/28/2017 01:05 PM, Mario Giammarco wrote:
Hello,
we recently upgraded two clusters to Ceph luminous with bluestore and
we discovered that we have many more pgs in state
active+clean+inconsistent. (Possible data damage, xx pgs inconsistent)
This is probably due to checksums in bluestore that discover more errors.
We have some pools with replica 2 and some with replica 3.
I have read past forums thread and I have seen that Ceph do not repair
automatically inconsistent pgs.
Even manual repair sometime fails.
I would like to understand if I am losing my data:
- with replica 2 I hope that ceph chooses right replica looking at
checksums
- with replica 3 I hope that there are no problems at all
How can I tell ceph to simply create the second replica in another place?
Because I suppose that with replica 2 and inconsistent pgs I have only
one copy of data.
Thank you in advance for any help.
Mario
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com