Yes if you set back to 5, every time your loose an OSD your have to set to 4 and let the rebuild take place before putting back to 5.
I guess is all down to how important 100% up time is over you manually monitoring the back fill / fix the OSD / replace the OSD by dropping to 4 vs letting it do this automatically and risk a further OSD loss. If you have the space ID suggest going to 4 + 2 and migrating your data, this would remove the ongoing issue and give you some extra data protection from OSD loss. On Wed, Dec 12, 2018 at 11:43 AM David Young <funkypeng...@protonmail.com> wrote: > (accidentally forgot to reply to the list) > > Thank you, setting min_size to 4 allowed I/O again, and the 39 incomplete > PGs are now: > > 39 active+undersized+degraded+remapped+backfilling > > Once backfilling is done, I'll increase min_size to 5 again. > > Am I likely to encounter this issue whenever I loose an OSD (I/O freezes > and manually reducing size is required), and is there anything I should be > doing differently? > > Thanks again! > D > > > > Sent with ProtonMail <https://protonmail.com> Secure Email. > > ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐ > On Wednesday, December 12, 2018 3:31 PM, Ashley Merrick < > singap...@amerrick.co.uk> wrote: > > With EC the min size is set to K + 1. > > Generally EC is used with a M of 2 or more, reason min size is set to 1 is > now you are in a state when a further OSD loss will cause some PG’s to not > have at least K size available as you only have 1 extra M. > > As per the error you can get your pool back online by setting min_size to > 4. > > However this would only be a temp fix while you get the OSD back online / > rebuilt so you can go back to your 4 + 1 state. > > ,Ash > > On Wed, 12 Dec 2018 at 10:27 AM, David Young <funkypeng...@protonmail.com> > wrote: > >> Hi all, >> >> I have a small 2-node cluster with 40 OSDs, using erasure coding 4+1 >> >> I lost osd38, and now I have 39 incomplete PGs. >> >> --- >> PG_AVAILABILITY Reduced data availability: 39 pgs inactive, 39 pgs >> incomplete >> pg 22.2 is incomplete, acting [19,33,10,8,29] (reducing pool media >> min_size from 5 may help; search ceph.com/docs for 'incomplete') >> pg 22.f is incomplete, acting [17,9,23,14,15] (reducing pool media >> from 5 may help; search ceph.com/docs for 'incomplete') >> pg 22.12 is incomplete, acting [7,33,10,31,29] (reducing pool media >> min_size from 5 may help; search ceph.com/docs for 'incomplete') >> pg 22.13 is incomplete, acting [23,0,15,33,13] (reducing pool media >> min_size from 5 may help; search ceph.com/docs for 'incomplete') >> pg 22.23 is incomplete, acting [29,17,18,15,12] (reducing pool media >> min_size from 5 may help; search ceph.com/docs for 'incomplete') >> <snip> >> --- >> >> My EC profile is below: >> >> --- >> root@prod1:~# ceph osd erasure-code-profile get ec-41-profile >> crush-device-class= >> crush-failure-domain=osd >> crush-root=default >> jerasure-per-chunk-alignment=false >> k=4 >> m=1 >> plugin=jerasure >> technique=reed_sol_van >> w=8 >> --- >> >> When I query one of the incomplete PGs, I see this: >> >> --- >> "recovery_state": [ >> { >> "name": "Started/Primary/Peering/Incomplete", >> "enter_time": "2018-12-11 20:46:11.645796", >> "comment": "not enough complete instances of this PG" >> }, >> --- >> >> And this: >> >> --- >> "probing_osds": [ >> "0(4)", >> "7(2)", >> "9(1)", >> "11(4)", >> "22(3)", >> "29(2)", >> "36(0)" >> ], >> "down_osds_we_would_probe": [ >> 38 >> ], >> "peering_blocked_by": [] >> }, >> --- >> >> I have set this in /etc/ceph/ceph.conf to no effect: >> osd_find_best_info_ignore_history_les = true >> >> >> As a result of the incomplete PGs, I/O is currently frozen to at last >> part of my cephfs. >> >> I expected to be able to tolerate the loss of an OSD without issue, is >> there anything I can do to restore these incomplete PGs? >> >> When I bring back a new osd38, I see: >> --- >> "probing_osds": [ >> "4(2)", >> "11(3)", >> "22(1)", >> "24(1)", >> "26(2)", >> "36(4)", >> "38(1)", >> "39(0)" >> ], >> "down_osds_we_would_probe": [], >> "peering_blocked_by": [] >> }, >> { >> "name": "Started", >> "enter_time": "2018-12-11 21:06:35.307379" >> } >> --- >> >> But my recovery state is still: >> >> --- >> "recovery_state": [ >> { >> "name": "Started/Primary/Peering/Incomplete", >> "enter_time": "2018-12-11 21:06:35.320292", >> "comment": "not enough complete instances of this PG" >> }, >> --- >> >> Any ideas? >> >> Thanks! >> D >> >> _______________________________________________ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > > > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com