Hi, Thanks to your answers now I understand better this part of ceph. I did the change on the crushmap that Maxime suggested, after that the results are what I expect from the beginning:
# ceph osd df ID WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR PGS 0 7.27100 1.00000 7445G 1830G 5614G 24.59 0.98 238 3 7.27100 1.00000 7445G 1700G 5744G 22.84 0.91 229 4 7.27100 1.00000 7445G 1731G 5713G 23.26 0.93 233 1 1.81299 1.00000 1856G 661G 1195G 35.63 1.43 87 5 1.81299 1.00000 1856G 544G 1311G 29.34 1.17 73 6 1.81299 1.00000 1856G 519G 1337G 27.98 1.12 71 2 2.72198 1.00000 2787G 766G 2021G 27.50 1.10 116 7 2.72198 1.00000 2787G 651G 2136G 23.36 0.93 103 8 2.72198 1.00000 2787G 661G 2126G 23.72 0.95 98 TOTAL 36267G 9067G 27200G 25.00 MIN/MAX VAR: 0.91/1.43 STDDEV: 4.20 # I understand that the ceph defaults "type host" are safer than "type osd", but like I said before this cluster is only for testing purposes only. Thanks for all your answers :) 2017-06-06 9:20 GMT+02:00 Maxime Guyot <max...@root314.com>: > Hi Félix, > > Changing the failure domain to OSD is probably the easiest option if this > is a test cluster. I think the commands would go like: > - ceph osd getcrushmap -o map.bin > - crushtool -d map.bin -o map.txt > - sed -i 's/step chooseleaf firstn 0 type host/step chooseleaf firstn 0 > type osd/' map.txt > - crushtool -c map.txt -o map.bin > - ceph osd setcrushmap -i map.bin > > Moving HDDs into ~8TB/server would be a good option if this is a capacity > focused use case. It will allow you to reboot 1 server at a time without > radosgw down time. You would target for 26/3 = 8.66TB/ node so: > - node1: 1x8TB > - node2: 1x8TB +1x2TB > - node3: 2x6 TB + 1x2TB > > If you are more concerned about performance then set the weights to 1 on > all HDDs and forget about the wasted capacity. > > Cheers, > Maxime > > > On Tue, 6 Jun 2017 at 00:44 Christian Wuerdig <christian.wuer...@gmail.com> > wrote: > >> Yet another option is to change the failure domain to OSD instead host >> (this avoids having to move disks around and will probably meet you initial >> expectations). >> Means your cluster will become unavailable when you loose a host until >> you fix it though. OTOH you probably don't have too much leeway anyway with >> just 3 hosts so it might be an acceptable trade-off. It also means you can >> just add new OSDs to the servers wherever they fit. >> >> On Tue, Jun 6, 2017 at 1:51 AM, David Turner <drakonst...@gmail.com> >> wrote: >> >>> If you want to resolve your issue without purchasing another node, you >>> should move one disk of each size into each server. This process will be >>> quite painful as you'll need to actually move the disks in the crush map to >>> be under a different host and then all of your data will move around, but >>> then your weights will be able to utilize the weights and distribute the >>> data between the 2TB, 3TB, and 8TB drives much more evenly. >>> >>> On Mon, Jun 5, 2017 at 9:21 AM Loic Dachary <l...@dachary.org> wrote: >>> >>>> >>>> >>>> On 06/05/2017 02:48 PM, Christian Balzer wrote: >>>> > >>>> > Hello, >>>> > >>>> > On Mon, 5 Jun 2017 13:54:02 +0200 Félix Barbeira wrote: >>>> > >>>> >> Hi, >>>> >> >>>> >> We have a small cluster for radosgw use only. It has three nodes, >>>> witch 3 >>>> > ^^^^^ ^^^^^ >>>> >> osds each. Each node has different disk sizes: >>>> >> >>>> > >>>> > There's your answer, staring you right in the face. >>>> > >>>> > Your default replication size is 3, your default failure domain is >>>> host. >>>> > >>>> > Ceph can not distribute data according to the weight, since it needs >>>> to be >>>> > on a different node (one replica per node) to comply with the replica >>>> size. >>>> >>>> Another way to look at it is to imagine a situation where 10TB worth of >>>> data >>>> is stored on node01 which has 8x3 24TB. Since you asked for 3 replicas, >>>> this >>>> data must be replicated to node02 but ... there only is 2x3 6TB >>>> available. >>>> So the maximum you can store is 6TB and remaining disk space on node01 >>>> and node03 >>>> will never be used. >>>> >>>> python-crush analyze will display a message about that situation and >>>> show which buckets >>>> are overweighted. >>>> >>>> Cheers >>>> >>>> > >>>> > If your cluster had 4 or more nodes, you'd see what you expected. >>>> > And most likely wouldn't be happy about the performance with your 8TB >>>> HDDs >>>> > seeing 4 times more I/Os than then 2TB ones and thus becoming the >>>> > bottleneck of your cluster. >>>> > >>>> > Christian >>>> > >>>> >> node01 : 3x8TB >>>> >> node02 : 3x2TB >>>> >> node03 : 3x3TB >>>> >> >>>> >> I thought that the weight handle the amount of data that every osd >>>> receive. >>>> >> In this case for example the node with the 8TB disks should receive >>>> more >>>> >> than the rest, right? All of them receive the same amount of data >>>> and the >>>> >> smaller disk (2TB) reaches 100% before the bigger ones. Am I doing >>>> >> something wrong? >>>> >> >>>> >> The cluster is jewel LTS 10.2.7. >>>> >> >>>> >> # ceph osd df >>>> >> ID WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR PGS >>>> >> 0 7.27060 1.00000 7445G 1012G 6432G 13.60 0.57 133 >>>> >> 3 7.27060 1.00000 7445G 1081G 6363G 14.52 0.61 163 >>>> >> 4 7.27060 1.00000 7445G 787G 6657G 10.58 0.44 120 >>>> >> 1 1.81310 1.00000 1856G 1047G 809G 56.41 2.37 143 >>>> >> 5 1.81310 1.00000 1856G 956G 899G 51.53 2.16 143 >>>> >> 6 1.81310 1.00000 1856G 877G 979G 47.24 1.98 130 >>>> >> 2 2.72229 1.00000 2787G 1010G 1776G 36.25 1.52 140 >>>> >> 7 2.72229 1.00000 2787G 831G 1955G 29.83 1.25 130 >>>> >> 8 2.72229 1.00000 2787G 1038G 1748G 37.27 1.56 146 >>>> >> TOTAL 36267G 8643G 27624G 23.83 >>>> >> MIN/MAX VAR: 0.44/2.37 STDDEV: 18.60 >>>> >> # >>>> >> >>>> >> # ceph osd tree >>>> >> ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY >>>> >> -1 35.41795 root default >>>> >> -2 21.81180 host node01 >>>> >> 0 7.27060 osd.0 up 1.00000 1.00000 >>>> >> 3 7.27060 osd.3 up 1.00000 1.00000 >>>> >> 4 7.27060 osd.4 up 1.00000 1.00000 >>>> >> -3 5.43929 host node02 >>>> >> 1 1.81310 osd.1 up 1.00000 1.00000 >>>> >> 5 1.81310 osd.5 up 1.00000 1.00000 >>>> >> 6 1.81310 osd.6 up 1.00000 1.00000 >>>> >> -4 8.16687 host node03 >>>> >> 2 2.72229 osd.2 up 1.00000 1.00000 >>>> >> 7 2.72229 osd.7 up 1.00000 1.00000 >>>> >> 8 2.72229 osd.8 up 1.00000 1.00000 >>>> >> # >>>> >> >>>> >> # ceph -s >>>> >> cluster 49ba9695-7199-4c21-9199-ac321e60065e >>>> >> health HEALTH_OK >>>> >> monmap e1: 3 mons at >>>> >> {ceph-mon01=[x:x:x:x:x:x:x:x]:6789/0,ceph-mon02=[x:x:x:x:x: >>>> x:x:x]:6789/0,ceph-mon03=[x:x:x:x:x:x:x:x]:6789/0} >>>> >> election epoch 48, quorum 0,1,2 >>>> ceph-mon01,ceph-mon03,ceph-mon02 >>>> >> osdmap e265: 9 osds: 9 up, 9 in >>>> >> flags sortbitwise,require_jewel_osds >>>> >> pgmap v95701: 416 pgs, 11 pools, 2879 GB data, 729 kobjects >>>> >> 8643 GB used, 27624 GB / 36267 GB avail >>>> >> 416 active+clean >>>> >> # >>>> >> >>>> >> # ceph osd pool ls >>>> >> .rgw.root >>>> >> default.rgw.control >>>> >> default.rgw.data.root >>>> >> default.rgw.gc >>>> >> default.rgw.log >>>> >> default.rgw.users.uid >>>> >> default.rgw.users.keys >>>> >> default.rgw.buckets.index >>>> >> default.rgw.buckets.non-ec >>>> >> default.rgw.buckets.data >>>> >> default.rgw.users.email >>>> >> # >>>> >> >>>> >> # ceph df >>>> >> GLOBAL: >>>> >> SIZE AVAIL RAW USED %RAW USED >>>> >> 36267G 27624G 8643G 23.83 >>>> >> POOLS: >>>> >> NAME ID USED %USED MAX >>>> AVAIL >>>> >> OBJECTS >>>> >> .rgw.root 1 1588 0 >>>> 5269G >>>> >> 4 >>>> >> default.rgw.control 2 0 0 >>>> 5269G >>>> >> 8 >>>> >> default.rgw.data.root 3 8761 0 >>>> 5269G >>>> >> 28 >>>> >> default.rgw.gc 4 0 0 >>>> 5269G >>>> >> 32 >>>> >> default.rgw.log 5 0 0 >>>> 5269G >>>> >> 127 >>>> >> default.rgw.users.uid 6 4887 0 >>>> 5269G >>>> >> 28 >>>> >> default.rgw.users.keys 7 144 0 >>>> 5269G >>>> >> 16 >>>> >> default.rgw.buckets.index 9 0 0 >>>> 5269G >>>> >> 14 >>>> >> default.rgw.buckets.non-ec 10 0 0 >>>> 5269G >>>> >> 3 >>>> >> default.rgw.buckets.data 11 2879G 35.34 >>>> 5269G >>>> >> 746848 >>>> >> default.rgw.users.email 12 13 0 >>>> 5269G >>>> >> 1 >>>> >> # >>>> >> >>>> > >>>> > >>>> >>>> -- >>>> Loïc Dachary, Artisan Logiciel Libre >>>> _______________________________________________ >>>> ceph-users mailing list >>>> ceph-users@lists.ceph.com >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>> >>> >>> _______________________________________________ >>> ceph-users mailing list >>> ceph-users@lists.ceph.com >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> >>> >> _______________________________________________ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > -- Félix Barbeira.
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com