It is inherently dangerous to accept client IO - particularly writes - when at k. Just like it's dangerous to accept IO with 1 replica in replicated mode. It is not inherently dangerous to do recovery when at k, but apparently it was originally written to use min_size rather than k. Looking at the PR, the actual code change is fairly small, ~30 lines, but it's a fairly critical change and has several pages of testing code associated with it. It also requires setting "osd_allow_recovery_below_min_size" just in case. So clearly it is being treated with caution.
On Wed, Jul 24, 2019 at 2:28 PM Jean-Philippe Méthot <jp.met...@planethoster.info> wrote: > > Thank you, that does make sense. I was completely unaware that min size was > k+1 and not k. Had I known that, I would have designed this pool differently. > > Regarding that feature for Octopus, I’m guessing it shouldn't be dangerous > for data integrity to recover at less than min_size? > > Jean-Philippe Méthot > Openstack system administrator > Administrateur système Openstack > PlanetHoster inc. > > > > > Le 24 juill. 2019 à 13:49, Nathan Fish <lordci...@gmail.com> a écrit : > > 2/3 monitors is enough to maintain quorum, as is any majority. > > However, EC pools have a default min_size of k+1 chunks. > This can be adjusted to k, but that has it's own dangers. > I assume you are using failure domain = "host"? > As you had k=6,m=2, and lost 2 failure domains, you had k chunks left, > resulting in all IO stopping. > > Currently, EC pools that have k chunks but less than min_size do not rebuild. > This is being worked on for Octopus: https://github.com/ceph/ceph/pull/17619 > > k=6,m=2 is therefore somewhat slim for a 10-host cluster. > I do not currently use EC, as I have only 3 failure domains, so others > here may know better than me, > but I might have done k=6,m=3. This would allow rebuilding to OK from > 1 host failure, and remaining available in WARN state with 2 hosts > down. > k=4,m=4 would be very safe, but potentially too expensive. > > > On Wed, Jul 24, 2019 at 1:31 PM Jean-Philippe Méthot > <jp.met...@planethoster.info> wrote: > > > Hi, > > I’m running in production a 3 monitors, 10 osdnodes ceph cluster. This > cluster is used to host Openstack VM rbd. My pools are set to use a k=6 m=2 > erasure code profile with a 3 copy metadata pool in front. The cluster runs > well, but we recently had a short outage which triggered unexpected behaviour > in the cluster. > > I’ve always been under the impression that Ceph would continue working > properly even if nodes would go down. I tested it several months ago with > this configuration and it worked fine as long as only 2 nodes went down. > However, this time, the first monitor as well as two osd nodes went down. As > a result, Openstack VMs were able to mount their rbd volume but unable to > read from it, even after the cluster had recovered with the following message > : Reduced data availability: 599 pgs inactive, 599 pgs incomplete . > > I believe the cluster should have continued to work properly despite the > outage, so what could have prevented it from functioning? Is it because there > was only two monitors remaining? Or is it that reduced data availability > message? In that case, is my erasure coding configuration fine for that > number of nodes? > > Jean-Philippe Méthot > Openstack system administrator > Administrateur système Openstack > PlanetHoster inc. > > > > > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > _______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com