Re: [ceph-users] handling different disk sizes

Félix Barbeira Tue, 06 Jun 2017 22:41:12 -0700

Hi,

Thanks to your answers now I understand better this part of ceph. I did the
change on the crushmap that Maxime suggested, after that the results are
what I expect from the beginning:


# ceph osd df
ID WEIGHT  REWEIGHT SIZE   USE   AVAIL  %USE  VAR  PGS
 0 7.27100  1.00000  7445G 1830G  5614G 24.59 0.98 238
 3 7.27100  1.00000  7445G 1700G  5744G 22.84 0.91 229
 4 7.27100  1.00000  7445G 1731G  5713G 23.26 0.93 233
 1 1.81299  1.00000  1856G  661G  1195G 35.63 1.43  87
 5 1.81299  1.00000  1856G  544G  1311G 29.34 1.17  73
 6 1.81299  1.00000  1856G  519G  1337G 27.98 1.12  71
 2 2.72198  1.00000  2787G  766G  2021G 27.50 1.10 116
 7 2.72198  1.00000  2787G  651G  2136G 23.36 0.93 103
 8 2.72198  1.00000  2787G  661G  2126G 23.72 0.95  98
              TOTAL 36267G 9067G 27200G 25.00
MIN/MAX VAR: 0.91/1.43  STDDEV: 4.20
#

I understand that the ceph defaults "type host" are safer than "type osd",
but like I said before this cluster is only for testing purposes only.

Thanks for all your answers :)

2017-06-06 9:20 GMT+02:00 Maxime Guyot <max...@root314.com>:

> Hi Félix,
>
> Changing the failure domain to OSD is probably the easiest option if this
> is a test cluster. I think the commands would go like:
> - ceph osd getcrushmap -o map.bin
> - crushtool -d map.bin -o map.txt
> - sed -i 's/step chooseleaf firstn 0 type host/step chooseleaf firstn 0
> type osd/' map.txt
> - crushtool -c map.txt -o map.bin
> - ceph osd setcrushmap -i map.bin
>
> Moving HDDs into ~8TB/server would be a good option if this is a capacity
> focused use case. It will allow you to reboot 1 server at a time without
> radosgw down time. You would target for 26/3 = 8.66TB/ node so:
> - node1: 1x8TB
> - node2: 1x8TB +1x2TB
> - node3: 2x6 TB + 1x2TB
>
> If you are more concerned about performance then set the weights to 1 on
> all HDDs and forget about the wasted capacity.
>
> Cheers,
> Maxime
>
>
> On Tue, 6 Jun 2017 at 00:44 Christian Wuerdig <christian.wuer...@gmail.com>
> wrote:
>
>> Yet another option is to change the failure domain to OSD instead host
>> (this avoids having to move disks around and will probably meet you initial
>> expectations).
>> Means your cluster will become unavailable when you loose a host until
>> you fix it though. OTOH you probably don't have too much leeway anyway with
>> just 3 hosts so it might be an acceptable trade-off. It also means you can
>> just add new OSDs to the servers wherever they fit.
>>
>> On Tue, Jun 6, 2017 at 1:51 AM, David Turner <drakonst...@gmail.com>
>> wrote:
>>
>>> If you want to resolve your issue without purchasing another node, you
>>> should move one disk of each size into each server.  This process will be
>>> quite painful as you'll need to actually move the disks in the crush map to
>>> be under a different host and then all of your data will move around, but
>>> then your weights will be able to utilize the weights and distribute the
>>> data between the 2TB, 3TB, and 8TB drives much more evenly.
>>>
>>> On Mon, Jun 5, 2017 at 9:21 AM Loic Dachary <l...@dachary.org> wrote:
>>>
>>>>
>>>>
>>>> On 06/05/2017 02:48 PM, Christian Balzer wrote:
>>>> >
>>>> > Hello,
>>>> >
>>>> > On Mon, 5 Jun 2017 13:54:02 +0200 Félix Barbeira wrote:
>>>> >
>>>> >> Hi,
>>>> >>
>>>> >> We have a small cluster for radosgw use only. It has three nodes,
>>>> witch 3
>>>> >             ^^^^^                                      ^^^^^
>>>> >> osds each. Each node has different disk sizes:
>>>> >>
>>>> >
>>>> > There's your answer, staring you right in the face.
>>>> >
>>>> > Your default replication size is 3, your default failure domain is
>>>> host.
>>>> >
>>>> > Ceph can not distribute data according to the weight, since it needs
>>>> to be
>>>> > on a different node (one replica per node) to comply with the replica
>>>> size.
>>>>
>>>> Another way to look at it is to imagine a situation where 10TB worth of
>>>> data
>>>> is stored on node01 which has 8x3 24TB. Since you asked for 3 replicas,
>>>> this
>>>> data must be replicated to node02 but ... there only is 2x3 6TB
>>>> available.
>>>> So the maximum you can store is 6TB and remaining disk space on node01
>>>> and node03
>>>> will never be used.
>>>>
>>>> python-crush analyze will display a message about that situation and
>>>> show which buckets
>>>> are overweighted.
>>>>
>>>> Cheers
>>>>
>>>> >
>>>> > If your cluster had 4 or more nodes, you'd see what you expected.
>>>> > And most likely wouldn't be happy about the performance with your 8TB
>>>> HDDs
>>>> > seeing 4 times more I/Os than then 2TB ones and thus becoming the
>>>> > bottleneck of your cluster.
>>>> >
>>>> > Christian
>>>> >
>>>> >> node01 : 3x8TB
>>>> >> node02 : 3x2TB
>>>> >> node03 : 3x3TB
>>>> >>
>>>> >> I thought that the weight handle the amount of data that every osd
>>>> receive.
>>>> >> In this case for example the node with the 8TB disks should receive
>>>> more
>>>> >> than the rest, right? All of them receive the same amount of data
>>>> and the
>>>> >> smaller disk (2TB) reaches 100% before the bigger ones. Am I doing
>>>> >> something wrong?
>>>> >>
>>>> >> The cluster is jewel LTS 10.2.7.
>>>> >>
>>>> >> # ceph osd df
>>>> >> ID WEIGHT  REWEIGHT SIZE   USE   AVAIL  %USE  VAR  PGS
>>>> >>  0 7.27060  1.00000  7445G 1012G  6432G 13.60 0.57 133
>>>> >>  3 7.27060  1.00000  7445G 1081G  6363G 14.52 0.61 163
>>>> >>  4 7.27060  1.00000  7445G  787G  6657G 10.58 0.44 120
>>>> >>  1 1.81310  1.00000  1856G 1047G   809G 56.41 2.37 143
>>>> >>  5 1.81310  1.00000  1856G  956G   899G 51.53 2.16 143
>>>> >>  6 1.81310  1.00000  1856G  877G   979G 47.24 1.98 130
>>>> >>  2 2.72229  1.00000  2787G 1010G  1776G 36.25 1.52 140
>>>> >>  7 2.72229  1.00000  2787G  831G  1955G 29.83 1.25 130
>>>> >>  8 2.72229  1.00000  2787G 1038G  1748G 37.27 1.56 146
>>>> >>               TOTAL 36267G 8643G 27624G 23.83
>>>> >> MIN/MAX VAR: 0.44/2.37  STDDEV: 18.60
>>>> >> #
>>>> >>
>>>> >> # ceph osd tree
>>>> >> ID WEIGHT   TYPE NAME      UP/DOWN REWEIGHT PRIMARY-AFFINITY
>>>> >> -1 35.41795 root default
>>>> >> -2 21.81180     host node01
>>>> >>  0  7.27060         osd.0       up  1.00000          1.00000
>>>> >>  3  7.27060         osd.3       up  1.00000          1.00000
>>>> >>  4  7.27060         osd.4       up  1.00000          1.00000
>>>> >> -3  5.43929     host node02
>>>> >>  1  1.81310         osd.1       up  1.00000          1.00000
>>>> >>  5  1.81310         osd.5       up  1.00000          1.00000
>>>> >>  6  1.81310         osd.6       up  1.00000          1.00000
>>>> >> -4  8.16687     host node03
>>>> >>  2  2.72229         osd.2       up  1.00000          1.00000
>>>> >>  7  2.72229         osd.7       up  1.00000          1.00000
>>>> >>  8  2.72229         osd.8       up  1.00000          1.00000
>>>> >> #
>>>> >>
>>>> >> # ceph -s
>>>> >>     cluster 49ba9695-7199-4c21-9199-ac321e60065e
>>>> >>      health HEALTH_OK
>>>> >>      monmap e1: 3 mons at
>>>> >> {ceph-mon01=[x:x:x:x:x:x:x:x]:6789/0,ceph-mon02=[x:x:x:x:x:
>>>> x:x:x]:6789/0,ceph-mon03=[x:x:x:x:x:x:x:x]:6789/0}
>>>> >>             election epoch 48, quorum 0,1,2
>>>> ceph-mon01,ceph-mon03,ceph-mon02
>>>> >>      osdmap e265: 9 osds: 9 up, 9 in
>>>> >>             flags sortbitwise,require_jewel_osds
>>>> >>       pgmap v95701: 416 pgs, 11 pools, 2879 GB data, 729 kobjects
>>>> >>             8643 GB used, 27624 GB / 36267 GB avail
>>>> >>                  416 active+clean
>>>> >> #
>>>> >>
>>>> >> # ceph osd pool ls
>>>> >> .rgw.root
>>>> >> default.rgw.control
>>>> >> default.rgw.data.root
>>>> >> default.rgw.gc
>>>> >> default.rgw.log
>>>> >> default.rgw.users.uid
>>>> >> default.rgw.users.keys
>>>> >> default.rgw.buckets.index
>>>> >> default.rgw.buckets.non-ec
>>>> >> default.rgw.buckets.data
>>>> >> default.rgw.users.email
>>>> >> #
>>>> >>
>>>> >> # ceph df
>>>> >> GLOBAL:
>>>> >>     SIZE       AVAIL      RAW USED     %RAW USED
>>>> >>     36267G     27624G        8643G         23.83
>>>> >> POOLS:
>>>> >>     NAME                           ID     USED      %USED     MAX
>>>> AVAIL
>>>> >> OBJECTS
>>>> >>     .rgw.root                      1       1588         0
>>>>  5269G
>>>> >>       4
>>>> >>     default.rgw.control            2          0         0
>>>>  5269G
>>>> >>       8
>>>> >>     default.rgw.data.root          3       8761         0
>>>>  5269G
>>>> >>      28
>>>> >>     default.rgw.gc                 4          0         0
>>>>  5269G
>>>> >>      32
>>>> >>     default.rgw.log                5          0         0
>>>>  5269G
>>>> >>     127
>>>> >>     default.rgw.users.uid          6       4887         0
>>>>  5269G
>>>> >>      28
>>>> >>     default.rgw.users.keys         7        144         0
>>>>  5269G
>>>> >>      16
>>>> >>     default.rgw.buckets.index      9          0         0
>>>>  5269G
>>>> >>      14
>>>> >>     default.rgw.buckets.non-ec     10         0         0
>>>>  5269G
>>>> >>       3
>>>> >>     default.rgw.buckets.data       11     2879G     35.34
>>>>  5269G
>>>> >>  746848
>>>> >>     default.rgw.users.email        12        13         0
>>>>  5269G
>>>> >>       1
>>>> >> #
>>>> >>
>>>> >
>>>> >
>>>>
>>>> --
>>>> Loïc Dachary, Artisan Logiciel Libre
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-users@lists.ceph.com
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>
>>>
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>


-- 
Félix Barbeira.

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] handling different disk sizes

Reply via email to