Op vr 8 feb. 2019 om 11:31 schreef Scheurer François < francois.scheu...@everyware.ch>:
> Dear Eugen Block > Dear Alan Johnson > > > Thank you for your answers. > > So we will use EC 3+2 on 6 nodes. > Currently with only 4 osd's per node, then 8 and later 20. > > > >Just to add, that a more general formula is that the number of nodes > should be greater than or equal to k+m+m so N>=k+m+m for full recovery > > Understood. > EC k+m assumes the case of loosing m nodes and that would require m > 'spare' nodes to recover, so k+m+m in total. > But the loss of a single node should allow a full recovery, shouldn'it ? > > Having 3+2 on 6 nodes should be able to: > -survive the loss of max 2 nodes simultaneously > Yes and No, technically you can survive a 2 node failure but EC requires K+1 nodes to allow writes, so every IO freezes (until all affected PG's are recovered to at least K+1) when losing the second node. So yes you survive, but no you can't use the cluster for a while during this, so if you want to keep using your cluster at all times you can only have 1 node failure. > -survive the loss of max 3 nodes, if the recovery has enough time to > complete between failures > I think this kind of scenario shouldn't even be considered. > -recover the loss of max 1 node > > Only if there's enough free disk space left to hold all the data. Kind regards, Caspar > >If the pools are empty I also wouldn't expect that, is restarting one OSD > also that slow or is it just when you reboot the whole cluster? > It also happens after rebooting a single node. > > In the mon logs we see a lot os such messages: > > 2019-02-06 23:07:46.003473 7f14d8ed6700 1 mon.ewos1-osd1-prod@0(leader).osd > e116 prepare_failure osd.17 10.38.66.71:6803/76983 from osd.1 > 10.38.67.72:6800/75206 is reporting failure:1 > 2019-02-06 23:07:46.003486 7f14d8ed6700 0 log_channel(cluster) log [DBG] > : osd.17 10.38.66.71:6803/76983 reported failed by osd.1 > 10.38.67.72:6800/75206 > 2019-02-06 <http://10.38.67.72:6800/752062019-02-06> 23:07:57.948959 > 7f14d8ed6700 1 mon.ewos1-osd1-prod@0(leader).osd e116 prepare_failure > osd.17 10.38.66.71:6803/76983 from osd.1 10.38.67.72:6800/75206 is > reporting failure:0 > 2019-02-06 23:07:57.948971 7f14d8ed6700 0 log_channel(cluster) log [DBG] > : osd.17 10.38.66.71:6803/76983 failure report canceled by osd.1 > 10.38.67.72:6800/75206 > 2019-02-06 <http://10.38.67.72:6800/752062019-02-06> 23:08:54.632356 > 7f14d8ed6700 1 mon.ewos1-osd1-prod@0(leader).osd e116 prepare_failure > osd.0 10.38.65.72:6800/72872 from osd.17 10.38.66.71:6803/76983 is > reporting failure:1 > 2019-02-06 23:08:54.632374 7f14d8ed6700 0 log_channel(cluster) log [DBG] > : osd.0 10.38.65.72:6800/72872 reported failed by osd.17 > 10.38.66.71:6803/76983 > 2019-02-06 <http://10.38.66.71:6803/769832019-02-06> 23:10:21.333513 > 7f14d8ed6700 1 mon.ewos1-osd1-prod@0(leader).osd e116 prepare_failure > osd.23 10.38.66.71:6807/79639 from osd.18 10.38.67.72:6806/79121 is > reporting failure:1 > 2019-02-06 23:10:21.333527 7f14d8ed6700 0 log_channel(cluster) log [DBG] > : osd.23 10.38.66.71:6807/79639 reported failed by osd.18 > 10.38.67.72:6806/79121 > 2019-02-06 <http://10.38.67.72:6806/791212019-02-06> 23:10:57.660468 > 7f14d8ed6700 1 mon.ewos1-osd1-prod@0(leader).osd e116 prepare_failure > osd.23 10.38.66.71:6807/79639 from osd.18 10.38.67.72:6806/79121 is > reporting failure:0 > 2019-02-06 23:10:57.660481 7f14d8ed6700 0 log_channel(cluster) log [DBG] > : osd.23 10.38.66.71:6807/79639 failure report canceled by osd.18 > 10.38.67.72:6806/79121 > > > > Best Regards > Francois Scheurer > > > > > > ________________________________________ > From: ceph-users <ceph-users-boun...@lists.ceph.com> on behalf of Alan > Johnson <al...@supermicro.com> > Sent: Thursday, February 7, 2019 8:11 PM > To: Eugen Block; ceph-users@lists.ceph.com > Subject: Re: [ceph-users] best practices for EC pools > > Just to add, that a more general formula is that the number of nodes > should be greater than or equal to k+m+m so N>=k+m+m for full recovery > > -----Original Message----- > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of > Eugen Block > Sent: Thursday, February 7, 2019 8:47 AM > To: ceph-users@lists.ceph.com > Subject: Re: [ceph-users] best practices for EC pools > > Hi Francois, > > > Is that correct that recovery will be forbidden by the crush rule if a > > node is down? > > yes, that is correct, failure-domain=host means no two chunks of the same > PG can be on the same host. So if your PG is divided into 6 chunks, they're > all on different hosts, no recovery is possible at this point (for the > EC-pool). > > > After rebooting all nodes we noticed that the recovery was slow, maybe > > half an hour, but all pools are currently empty (new install). > > This is odd... > > If the pools are empty I also wouldn't expect that, is restarting one OSD > also that slow or is it just when you reboot the whole cluster? > > > Which k&m values are preferred on 6 nodes? > > It depends on the failures you expect and how many concurrent failures you > need to cover. > I think I would keep failure-domain=host (with only 4 OSDs per host). > As for the k and m values, 3+2 would make sense, I guess. That profile > would leave one host for recovery and two OSDs of one PG acting set could > fail without data loss, so as resilient as the 4+2 profile. This is one > approach, so please don't read this as *the* solution for your environment. > > Regards, > Eugen > > > Zitat von Scheurer François <francois.scheu...@everyware.ch>: > > > Dear All > > > > > > We created an erasure coded pool with k=4 m=2 with failure-domain=host > > but have only 6 osd nodes. > > Is that correct that recovery will be forbidden by the crush rule if a > > node is down? > > > > After rebooting all nodes we noticed that the recovery was slow, maybe > > half an hour, but all pools are currently empty (new install). > > This is odd... > > > > Can it be related to the k+m being equal to the number of nodes? > > (4+2=6) step set_choose_tries 100 was already in the EC crush rule. > > > > rule ewos1-prod_cinder_ec { > > id 2 > > type erasure > > min_size 3 > > max_size 6 > > step set_chooseleaf_tries 5 > > step set_choose_tries 100 > > step take default class nvme > > step chooseleaf indep 0 type host > > step emit > > } > > > > ceph osd erasure-code-profile set ec42 k=4 m=2 crush-root=default > > crush-failure-domain=host crush-device-class=nvme ceph osd pool create > > ewos1-prod_cinder_ec 256 256 erasure ec42 > > > > ceph version 12.2.10-543-gfc6f0c7299 > > (fc6f0c7299e3442e8a0ab83260849a6249ce7b5f) luminous (stable) > > > > cluster: > > id: b5e30221-a214-353c-b66b-8c37b4349123 > > health: HEALTH_WARN > > noout flag(s) set > > Reduced data availability: 125 pgs inactive, 32 pgs > > peering > > > > services: > > mon: 3 daemons, quorum > ewos1-osd1-prod,ewos1-osd3-prod,ewos1-osd5-prod > > mgr: ewos1-osd5-prod(active), standbys: ewos1-osd3-prod, > ewos1-osd1-prod > > osd: 24 osds: 24 up, 24 in > > flags noout > > > > data: > > pools: 4 pools, 1600 pgs > > objects: 0 objects, 0B > > usage: 24.3GiB used, 43.6TiB / 43.7TiB avail > > pgs: 7.812% pgs not active > > 1475 active+clean > > 93 activating > > 32 peering > > > > > > Which k&m values are preferred on 6 nodes? > > BTW, we plan to use this EC pool as a second rbd pool in Openstack, > > with the main first rbd pool being replicated size=3; it is nvme ssd > > only. > > > > > > Thanks for your help! > > > > > > > > Best Regards > > Francois Scheurer > > > > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > > https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.ceph.com_listinfo.cgi_ceph-2Dusers-2Dceph.com&d=DwIGaQ&c=4DxX-JX0i28X6V65hK0ftwVK1xnmwcYC0vo7GVya1JY&r=sgFiQgvQASiGFaHpitF5P9M9QDCRkgKGttwwMFt2VIU&m=pTchIHDm3u6d1bmWBYKGF0Akb9UelYSeP1pnEbEw85Q&s=FV0ocIQ2LDiwIdGtKE36tH50px_KHyRvz14eDP1qptI&e= > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com