Each of 3 clients from different buildings are picking same primary-affinity, and everything is slow at least on two. Instead of just read from their local OSD they read mostly from primary-affinity.
*What I need is something like primary-affinity for each client connection* ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 0.08189 root default -3 0.02730 host vm1 0 hdd 0.02730 osd.0 up 1.00000 1.00000 -10 0.02730 host vm2 1 hdd 0.02730 osd.1 up 1.00000 0.50000 -5 0.02730 host vm3 2 hdd 0.02730 osd.2 up 1.00000 0.50000 v On Tue, Nov 13, 2018 at 4:25 PM Jean-Charles Lopez <jelo...@redhat.com> wrote: > Hi Vlad, > > No need for a specific CRUSH map configuration. I’d suggest you use the > primary-affinity setting on the OSD so that only the OSDs that are close to > your read point are are selected as primary. > > See https://ceph.com/geen-categorie/ceph-primary-affinity/ for information > > Just set the primary affinity of all the OSDs in building 2 to 0. > > Only the OSDs in building 1 should then be used as primary OSDs. > > BR > JC > > On Nov 13, 2018, at 12:19, Vlad Kopylov <vladk...@gmail.com> wrote: > > Or is it possible to mount one OSD directly for read file access? > > v > > On Sun, Nov 11, 2018 at 1:47 PM Vlad Kopylov <vladk...@gmail.com> wrote: > >> Maybe it is possible if done via gateway-nfs export? >> Settings for gateway allow read osd selection? >> >> v >> >> On Sun, Nov 11, 2018 at 1:01 AM Martin Verges <martin.ver...@croit.io> >> wrote: >> >>> Hello Vlad, >>> >>> If you want to read from the same data, then it ist not possible (as far >>> I know). >>> >>> -- >>> Martin Verges >>> Managing director >>> >>> Mobile: +49 174 9335695 >>> E-Mail: martin.ver...@croit.io >>> Chat: https://t.me/MartinVerges >>> >>> croit GmbH, Freseniusstr. 31h, 81247 Munich >>> CEO: Martin Verges - VAT-ID: DE310638492 >>> Com. register: Amtsgericht Munich HRB 231263 >>> >>> Web: https://croit.io >>> YouTube: https://goo.gl/PGE1Bx >>> >>> Am Sa., 10. Nov. 2018, 03:47 hat Vlad Kopylov <vladk...@gmail.com> >>> geschrieben: >>> >>>> Maybe i missed something but FS is explicitly selecting pools to put >>>> files and metadata, like I did below. >>>> So if I create new pools - data in them will be different. If I apply >>>> the rule dc1_primary to cfs_data pool, and client from dc3 connects to fs >>>> t01 - it will start using dc1 hosts >>>> >>>> >>>> ceph osd pool create cfs_data 100 >>>> ceph osd pool create cfs_meta 100 >>>> ceph fs new t01 cfs_data cfs_meta >>>> sudo mount -t ceph ceph1:6789:/ /mnt/t01 -o >>>> name=admin,secretfile=/home/mciadmin/admin.secret >>>> >>>> rule dc1_primary { >>>> id 1 >>>> type replicated >>>> min_size 1 >>>> max_size 10 >>>> step take dc1 >>>> step chooseleaf firstn 1 type host >>>> step emit >>>> step take dc2 >>>> step chooseleaf firstn -2 type host >>>> step emit >>>> step take dc3 >>>> step chooseleaf firstn -2 type host >>>> step emit >>>> } >>>> >>>> On Fri, Nov 9, 2018 at 9:32 PM Vlad Kopylov <vladk...@gmail.com> wrote: >>>> >>>>> Just to confirm - it will still populate 3 copies in each datacenter? >>>>> Thought this map was to select where to write to, guess it does write >>>>> replication on the back end. >>>>> >>>>> I thought pools are completely separate and clients would not see each >>>>> others data? >>>>> >>>>> Thank you Martin! >>>>> >>>>> >>>>> >>>>> >>>>> On Fri, Nov 9, 2018 at 2:10 PM Martin Verges <martin.ver...@croit.io> >>>>> wrote: >>>>> >>>>>> Hello Vlad, >>>>>> >>>>>> you can generate something like this: >>>>>> >>>>>> rule dc1_primary_dc2_secondary { >>>>>> id 1 >>>>>> type replicated >>>>>> min_size 1 >>>>>> max_size 10 >>>>>> step take dc1 >>>>>> step chooseleaf firstn 1 type host >>>>>> step emit >>>>>> step take dc2 >>>>>> step chooseleaf firstn 1 type host >>>>>> step emit >>>>>> step take dc3 >>>>>> step chooseleaf firstn -2 type host >>>>>> step emit >>>>>> } >>>>>> >>>>>> rule dc2_primary_dc1_secondary { >>>>>> id 2 >>>>>> type replicated >>>>>> min_size 1 >>>>>> max_size 10 >>>>>> step take dc1 >>>>>> step chooseleaf firstn 1 type host >>>>>> step emit >>>>>> step take dc2 >>>>>> step chooseleaf firstn 1 type host >>>>>> step emit >>>>>> step take dc3 >>>>>> step chooseleaf firstn -2 type host >>>>>> step emit >>>>>> } >>>>>> >>>>>> After you added such crush rules, you can configure the pools: >>>>>> >>>>>> ~ $ ceph osd pool set <pool_for_dc1> crush_ruleset 1 >>>>>> ~ $ ceph osd pool set <pool_for_dc2> crush_ruleset 2 >>>>>> >>>>>> Now you place your workload from dc1 to the dc1 pool, and workload >>>>>> from dc2 to the dc2 pool. You could also use HDD with SSD journal (if >>>>>> your workload issn't that write intensive) and save some money in dc3 >>>>>> as your client would always read from a SSD and write to Hybrid. >>>>>> >>>>>> Btw. all this could be done with a few simple clicks through our web >>>>>> frontend. Even if you want to export it via CephFS / NFS / .. it is >>>>>> possible to set it on a per folder level. Feel free to take a look at >>>>>> https://www.youtube.com/watch?v=V33f7ipw9d4 to see how easy it could >>>>>> be. >>>>>> >>>>>> -- >>>>>> Martin Verges >>>>>> Managing director >>>>>> >>>>>> Mobile: +49 174 9335695 >>>>>> E-Mail: martin.ver...@croit.io >>>>>> Chat: https://t.me/MartinVerges >>>>>> >>>>>> croit GmbH, Freseniusstr. 31h, 81247 Munich >>>>>> CEO: Martin Verges - VAT-ID: DE310638492 >>>>>> Com. register: Amtsgericht Munich HRB 231263 >>>>>> >>>>>> Web: https://croit.io >>>>>> YouTube: https://goo.gl/PGE1Bx >>>>>> >>>>>> >>>>>> 2018-11-09 17:35 GMT+01:00 Vlad Kopylov <vladk...@gmail.com>: >>>>>> > Please disregard pg status, one of test vms was down for some time >>>>>> it is >>>>>> > healing. >>>>>> > Question only how to make it read from proper datacenter >>>>>> > >>>>>> > If you have an example. >>>>>> > >>>>>> > Thanks >>>>>> > >>>>>> > >>>>>> > On Fri, Nov 9, 2018 at 11:28 AM Vlad Kopylov <vladk...@gmail.com> >>>>>> wrote: >>>>>> >> >>>>>> >> Martin, thank you for the tip. >>>>>> >> googling ceph crush rule examples doesn't give much on rules, just >>>>>> static >>>>>> >> placement of buckets. >>>>>> >> this all seems to be for placing data, not to giving client in >>>>>> specific >>>>>> >> datacenter proper read osd >>>>>> >> >>>>>> >> maybe something wrong with placement groups? >>>>>> >> >>>>>> >> I added datacenter dc1 dc2 dc3 >>>>>> >> Current replicated_rule is >>>>>> >> >>>>>> >> rule replicated_rule { >>>>>> >> id 0 >>>>>> >> type replicated >>>>>> >> min_size 1 >>>>>> >> max_size 10 >>>>>> >> step take default >>>>>> >> step chooseleaf firstn 0 type host >>>>>> >> step emit >>>>>> >> } >>>>>> >> >>>>>> >> # buckets >>>>>> >> host ceph1 { >>>>>> >> id -3 # do not change unnecessarily >>>>>> >> id -2 class ssd # do not change unnecessarily >>>>>> >> # weight 1.000 >>>>>> >> alg straw2 >>>>>> >> hash 0 # rjenkins1 >>>>>> >> item osd.0 weight 1.000 >>>>>> >> } >>>>>> >> datacenter dc1 { >>>>>> >> id -9 # do not change unnecessarily >>>>>> >> id -4 class ssd # do not change unnecessarily >>>>>> >> # weight 1.000 >>>>>> >> alg straw2 >>>>>> >> hash 0 # rjenkins1 >>>>>> >> item ceph1 weight 1.000 >>>>>> >> } >>>>>> >> host ceph2 { >>>>>> >> id -5 # do not change unnecessarily >>>>>> >> id -6 class ssd # do not change unnecessarily >>>>>> >> # weight 1.000 >>>>>> >> alg straw2 >>>>>> >> hash 0 # rjenkins1 >>>>>> >> item osd.1 weight 1.000 >>>>>> >> } >>>>>> >> datacenter dc2 { >>>>>> >> id -10 # do not change unnecessarily >>>>>> >> id -8 class ssd # do not change unnecessarily >>>>>> >> # weight 1.000 >>>>>> >> alg straw2 >>>>>> >> hash 0 # rjenkins1 >>>>>> >> item ceph2 weight 1.000 >>>>>> >> } >>>>>> >> host ceph3 { >>>>>> >> id -7 # do not change unnecessarily >>>>>> >> id -12 class ssd # do not change unnecessarily >>>>>> >> # weight 1.000 >>>>>> >> alg straw2 >>>>>> >> hash 0 # rjenkins1 >>>>>> >> item osd.2 weight 1.000 >>>>>> >> } >>>>>> >> datacenter dc3 { >>>>>> >> id -11 # do not change unnecessarily >>>>>> >> id -13 class ssd # do not change unnecessarily >>>>>> >> # weight 1.000 >>>>>> >> alg straw2 >>>>>> >> hash 0 # rjenkins1 >>>>>> >> item ceph3 weight 1.000 >>>>>> >> } >>>>>> >> root default { >>>>>> >> id -1 # do not change unnecessarily >>>>>> >> id -14 class ssd # do not change unnecessarily >>>>>> >> # weight 3.000 >>>>>> >> alg straw2 >>>>>> >> hash 0 # rjenkins1 >>>>>> >> item dc1 weight 1.000 >>>>>> >> item dc2 weight 1.000 >>>>>> >> item dc3 weight 1.000 >>>>>> >> } >>>>>> >> >>>>>> >> >>>>>> >> #ceph pg dump >>>>>> >> dumped all >>>>>> >> version 29433 >>>>>> >> stamp 2018-11-09 11:23:44.510872 >>>>>> >> last_osdmap_epoch 0 >>>>>> >> last_pg_scan 0 >>>>>> >> PG_STAT OBJECTS MISSING_ON_PRIMARY DEGRADED MISPLACED UNFOUND >>>>>> BYTES LOG >>>>>> >> DISK_LOG STATE STATE_STAMP >>>>>> VERSION >>>>>> >> REPORTED UP UP_PRIMARY ACTING ACTING_PRIMARY LAST_SCRUB >>>>>> SCRUB_STAMP >>>>>> >> LAST_DEEP_SCRUB DEEP_SCRUB_STAMP SNAPTRIMQ_LEN >>>>>> >> 1.5f 0 0 0 0 0 >>>>>> 0 >>>>>> >> 0 0 active+clean 2018-11-09 04:35:32.320607 >>>>>> 0'0 >>>>>> >> 544:1317 [0,2,1] 0 [0,2,1] 0 0'0 >>>>>> 2018-11-09 >>>>>> >> 04:35:32.320561 0'0 2018-11-04 11:55:54.756115 >>>>>> 0 >>>>>> >> 2.5c 143 0 143 0 0 >>>>>> 19490267 >>>>>> >> 461 461 active+undersized+degraded 2018-11-08 >>>>>> 19:02:03.873218 508'461 >>>>>> >> 544:2100 [2,1] 2 [2,1] 2 290'380 >>>>>> 2018-11-07 >>>>>> >> 18:58:43.043719 64'120 2018-11-05 14:21:49.256324 >>>>>> 0 >>>>>> >> ..... >>>>>> >> sum 15239 0 2053 2659 0 2157615019 58286 58286 >>>>>> >> OSD_STAT USED AVAIL TOTAL HB_PEERS PG_SUM PRIMARY_PG_SUM >>>>>> >> 2 3.7 GiB 28 GiB 32 GiB [0,1] 200 73 >>>>>> >> 1 3.7 GiB 28 GiB 32 GiB [0,2] 200 58 >>>>>> >> 0 3.7 GiB 28 GiB 32 GiB [1,2] 173 69 >>>>>> >> sum 11 GiB 85 GiB 96 GiB >>>>>> >> >>>>>> >> #ceph pg map 2.5c >>>>>> >> osdmap e545 pg 2.5c (2.5c) -> up [2,1] acting [2,1] >>>>>> >> >>>>>> >> #pg map 1.5f >>>>>> >> osdmap e547 pg 1.5f (1.5f) -> up [0,2,1] acting [0,2,1] >>>>>> >> >>>>>> >> >>>>>> >> On Fri, Nov 9, 2018 at 2:21 AM Martin Verges < >>>>>> martin.ver...@croit.io> >>>>>> >> wrote: >>>>>> >>> >>>>>> >>> Hello Vlad, >>>>>> >>> >>>>>> >>> Ceph clients connect to the primary OSD of each PG. If you create >>>>>> a >>>>>> >>> crush rule for building1 and one for building2 that takes a OSD >>>>>> from >>>>>> >>> the same building as the first one, your reads to the pool will >>>>>> always >>>>>> >>> be on the same building (if the cluster is healthy) and only write >>>>>> >>> request get replicated to the other building. >>>>>> >>> >>>>>> >>> -- >>>>>> >>> Martin Verges >>>>>> >>> Managing director >>>>>> >>> >>>>>> >>> Mobile: +49 174 9335695 >>>>>> >>> E-Mail: martin.ver...@croit.io >>>>>> >>> Chat: https://t.me/MartinVerges >>>>>> >>> >>>>>> >>> croit GmbH, Freseniusstr. 31h, 81247 Munich >>>>>> >>> CEO: Martin Verges - VAT-ID: DE310638492 >>>>>> >>> Com. register: Amtsgericht Munich HRB 231263 >>>>>> >>> >>>>>> >>> Web: https://croit.io >>>>>> >>> YouTube: https://goo.gl/PGE1Bx >>>>>> >>> >>>>>> >>> >>>>>> >>> 2018-11-09 4:54 GMT+01:00 Vlad Kopylov <vladk...@gmail.com>: >>>>>> >>> > I am trying to test replicated ceph with servers in different >>>>>> >>> > buildings, and >>>>>> >>> > I have a read problem. >>>>>> >>> > Reads from one building go to osd in another building and vice >>>>>> versa, >>>>>> >>> > making >>>>>> >>> > reads slower then writes! Making read as slow as slowest node. >>>>>> >>> > >>>>>> >>> > Is there a way to >>>>>> >>> > - disable parallel read (so it reads only from the same osd >>>>>> node where >>>>>> >>> > mon >>>>>> >>> > is); >>>>>> >>> > - or give each client read restriction per osd? >>>>>> >>> > - or maybe strictly specify read osd on mount; >>>>>> >>> > - or have node read delay cap (for example if node time out is >>>>>> larger >>>>>> >>> > then 2 >>>>>> >>> > ms then do not use such node for read as other replicas are >>>>>> available). >>>>>> >>> > - or ability to place Clients on the Crush map - so it >>>>>> understands that >>>>>> >>> > osd >>>>>> >>> > in - for example osd in the same data-center as client has >>>>>> preference, >>>>>> >>> > and >>>>>> >>> > pull data from it/them. >>>>>> >>> > >>>>>> >>> > Mounting with kernel client latest mimic. >>>>>> >>> > >>>>>> >>> > Thank you! >>>>>> >>> > >>>>>> >>> > Vlad >>>>>> >>> > >>>>>> >>> > _______________________________________________ >>>>>> >>> > ceph-users mailing list >>>>>> >>> > ceph-users@lists.ceph.com >>>>>> >>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>>>> >>> > >>>>>> >>>>> _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com