Have a smallish cluster that has been expanding with almost a 50% increase in 
the number of OSD (16->24).

This has caused some issues with data integrity and cluster performance as we 
have increased PG count, and added OSDs.

8x nodes with 3x drives, connected over 2x10G.

My problem is that I have PG’s that have become grossly undersized 
(size=3,min_size=2), in some cases just 1 copy, which created a deadlock of io, 
backed up behind this PG without enough copies.

It has been backfilling and recovering at a steady pace, but it seems that all 
backfills are weighted equally, and the more serious PG’s could be at the front 
of the queue, or at the very end of the queue, with no apparent rhyme or reason.

This has been exasperated by a failing, but not failed OSD, which I have moved 
out, but still up, in an attempt for it to try and move its data off of itself 
gracefully, and not take on new io.

I guess my question would be “is there a way to get the most important/critical 
recovery/backfill operations completed ahead of less important/critical 
backfill/recovery.” i.e., tackle the 1 copy PG’s that are blocking io, ahead of 
the less used PG’s that have 2 copies and backfilling to make their 3rd.

Thanks,

Reed
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to