Re: [ceph-users] Changing replica size of a running pool

Peter Maloney Fri, 19 May 2017 07:34:29 -0700

On 05/05/17 21:32, Alejandro Comisario wrote:
> Thanks David!
> Any one ? more thoughts ?
>
> On Wed, May 3, 2017 at 3:38 PM, David Turner <drakonst...@gmail.com
> <mailto:drakonst...@gmail.com>> wrote:
>
>     Those are both things that people have done and both work. 
>     Neither is optimal, but both options work fine.  The best option
>     is to definitely just get a third node now as you aren't going to
>     be getting it for additional space from it later.  Your usable
>     space between a 2 node size 2 cluster and a 3 node size 3 cluster
>     is identical.
>
>     If getting a third node is not possible, I would recommend a size
>     2 min_size 2 configuration.  You will block writes if either of
>     your nodes or any copy of your data is down, but you will not get
>     into an inconsistent state that can happen with min_size of 1 (and
>     you can always set the min_size of a pool to 1 on the fly to
>     perform maintenance).  If you go with the option to use the
>     failure domain of OSDs instead of hosts and have size 3, then a
>     single node going down will block writes into your cluster.  The
>     only you gain from this is having 3 physical copies of the data
>     until you get a third node, but a lot of backfilling when you
>     change the crush rule.
>
>     A more complex option that I think would be a better solution than
>     your 2 options would be to create 2 hosts in your crush map for
>     each physical host and split the OSDs in each host evenly between
>     them.  That way you can have 2 copies of data in a given node, but
>     never all 3 copies.  You have your 3 copies of data and guaranteed
>     that not all 3 are on the same host.  Assuming min_size of 2, you
>     will still block writes if you restart either node.
>
Smart idea.
Or if you have space, size 4 min_size 2 and then you can still lose a
node. And you might think that's more space, but in a way it isn't... if
you count free space reserved for recovery. If your size 3 double nodes
die, then the other has to recover to size 2 and then it'll use the same
space as the size 4 pool. If the size 4 pool loses a node, it won't be
able to recover... it'll stay size 2, which is what your size 3 pool
would have been after recovery. So it's like it's pre-recovered. But you
probably get a bit more write latency in this setup.


>     If modifying the hosts in your crush map doesn't sound daunting,
>     then I would recommend going that route... For most people that is
>     more complex than they'd like to go and I would say size 2
>     min_size 2 would be the way to go until you get a third node.
>      #my2cents
>
>     On Wed, May 3, 2017 at 12:41 PM Maximiliano Venesio
>     <mass...@nubeliu.com <mailto:mass...@nubeliu.com>> wrote:
>
>         Guys hi.
>
>         I have a Jewel Cluster composed by two storage servers which
>         are configured on
>         the crush map as different buckets to store data.
>
>         I've to configure two new pools on this cluster with the certainty
>         that i'll have to add more servers in a short term.
>
>         Taking into account that the recommended replication size for
>         every
>         pool is 3, i'm thinking in two possible scenarios.
>
>         1) Set the replica size in 2 now, and in the future change the
>         replica
>         size to 3 on a running pool.
>         Is that possible? Can i have serious issues with the rebalance
>         of the
>         pgs, changing the pool size on the fly ?
>
>         2) Set the replica size to 3, and change the ruleset to
>         replicate by
>         OSD instead of HOST now, and in the future change this rule in the
>         ruleset to replicate again by host in a running pool.
>         Is that possible? Can i have serious issues with the rebalance
>         of the
>         pgs, changing the ruleset in a running pool ?
>
>         Which do you think is the best option ?
>
>
>         Thanks in advanced.
>
>
>         Maximiliano Venesio
>         Chief Cloud Architect | NUBELIU
>         E-mail: massimo@nubeliu.comCell: +54 9 11 3770 1853
>         <tel:+54%209%2011%203770-1853>
>         _
>         www.nubeliu.com <http://www.nubeliu.com>
>         _______________________________________________
>         ceph-users mailing list
>         ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
>         http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>         <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
>
>
>     _______________________________________________
>     ceph-users mailing list
>     ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
>     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>     <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
>
>
>
>
> -- 
> *Alejandro Comisario*
> *CTO | NUBELIU*
> E-mail: alejan...@nubeliu.com <mailto:alejan...@nubeliu.com>Cell: +54
> 9 11 3770 1857
> _
> www.nubeliu.com <http://www.nubeliu.com/>
>
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


-- 

--------------------------------------------
Peter Maloney
Brockmann Consult
Max-Planck-Str. 2
21502 Geesthacht
Germany
Tel: +49 4152 889 300
Fax: +49 4152 889 333
E-mail: peter.malo...@brockmann-consult.de
Internet: http://www.brockmann-consult.de
--------------------------------------------

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Changing replica size of a running pool

Reply via email to