Re: [ceph-users] read performance, separate client CRUSH maps or limit osd read access from each client

Jean-Charles Lopez Tue, 13 Nov 2018 13:26:13 -0800

Hi Vlad,

No need for a specific CRUSH map configuration. I’d suggest you use the 
primary-affinity setting on the OSD so that only the OSDs that are close to 
your read point are are selected as primary.


See https://ceph.com/geen-categorie/ceph-primary-affinity/ for information

Just set the primary affinity of all the OSDs in building 2 to 0.

Only the OSDs in building 1 should then be used as primary OSDs.

BR
JC

> On Nov 13, 2018, at 12:19, Vlad Kopylov <vladk...@gmail.com> wrote:
> 
> Or is it possible to mount one OSD directly for read file access?
> 
> v
> 
> On Sun, Nov 11, 2018 at 1:47 PM Vlad Kopylov <vladk...@gmail.com 
> <mailto:vladk...@gmail.com>> wrote:
> Maybe it is possible if done via gateway-nfs export?
> Settings for gateway allow read osd selection?
> 
> v
> 
> On Sun, Nov 11, 2018 at 1:01 AM Martin Verges <martin.ver...@croit.io 
> <mailto:martin.ver...@croit.io>> wrote:
> Hello Vlad,
> 
> If you want to read from the same data, then it ist not possible (as far I 
> know).
> 
> --
> Martin Verges
> Managing director
> 
> Mobile: +49 174 9335695
> E-Mail: martin.ver...@croit.io <mailto:martin.ver...@croit.io>
> Chat: https://t.me/MartinVerges <https://t.me/MartinVerges>
> 
> croit GmbH, Freseniusstr. 31h, 81247 Munich
> CEO: Martin Verges - VAT-ID: DE310638492
> Com. register: Amtsgericht Munich HRB 231263
> 
> Web: https://croit.io <https://croit.io/>
> YouTube: https://goo.gl/PGE1Bx <https://goo.gl/PGE1Bx>
> Am Sa., 10. Nov. 2018, 03:47 hat Vlad Kopylov <vladk...@gmail.com 
> <mailto:vladk...@gmail.com>> geschrieben:
> Maybe i missed something but FS is explicitly selecting pools to put files 
> and metadata, like I did below.
> So if I create new pools - data in them will be different. If I apply the 
> rule dc1_primary to cfs_data pool, and client from dc3 connects to fs t01 - 
> it will start using dc1 hosts
> 
> 
> ceph osd pool create cfs_data 100
> ceph osd pool create cfs_meta 100
> ceph fs new t01 cfs_data cfs_meta
> sudo mount -t ceph ceph1:6789:/ /mnt/t01 -o 
> name=admin,secretfile=/home/mciadmin/admin.secret
> 
> rule dc1_primary {
>         id 1
>         type replicated
>         min_size 1
>         max_size 10
>         step take dc1
>         step chooseleaf firstn 1 type host
>         step emit
>         step take dc2
>         step chooseleaf firstn -2 type host
>         step emit
>         step take dc3
>         step chooseleaf firstn -2 type host
>         step emit
> }
> 
> On Fri, Nov 9, 2018 at 9:32 PM Vlad Kopylov <vladk...@gmail.com 
> <mailto:vladk...@gmail.com>> wrote:
> Just to confirm - it will still populate  3 copies in each datacenter?
> Thought this map was to select where to write to, guess it does write 
> replication on the back end.
> 
> I thought pools are completely separate and clients would not see each others 
> data?
> 
> Thank you Martin!
> 
> 
> 
> 
> On Fri, Nov 9, 2018 at 2:10 PM Martin Verges <martin.ver...@croit.io 
> <mailto:martin.ver...@croit.io>> wrote:
> Hello Vlad,
> 
> you can generate something like this:
> 
> rule dc1_primary_dc2_secondary {
>         id 1
>         type replicated
>         min_size 1
>         max_size 10
>         step take dc1
>         step chooseleaf firstn 1 type host
>         step emit
>         step take dc2
>         step chooseleaf firstn 1 type host
>         step emit
>         step take dc3
>         step chooseleaf firstn -2 type host
>         step emit
> }
> 
> rule dc2_primary_dc1_secondary {
>         id 2
>         type replicated
>         min_size 1
>         max_size 10
>         step take dc1
>         step chooseleaf firstn 1 type host
>         step emit
>         step take dc2
>         step chooseleaf firstn 1 type host
>         step emit
>         step take dc3
>         step chooseleaf firstn -2 type host
>         step emit
> }
> 
> After you added such crush rules, you can configure the pools:
> 
> ~ $ ceph osd pool set <pool_for_dc1> crush_ruleset 1
> ~ $ ceph osd pool set <pool_for_dc2> crush_ruleset 2
> 
> Now you place your workload from dc1 to the dc1 pool, and workload
> from dc2 to the dc2 pool. You could also use HDD with SSD journal (if
> your workload issn't that write intensive) and save some money in dc3
> as your client would always read from a SSD and write to Hybrid.
> 
> Btw. all this could be done with a few simple clicks through our web
> frontend. Even if you want to export it via CephFS / NFS / .. it is
> possible to set it on a per folder level. Feel free to take a look at
> https://www.youtube.com/watch?v=V33f7ipw9d4 
> <https://www.youtube.com/watch?v=V33f7ipw9d4> to see how easy it could
> be.
> 
> --
> Martin Verges
> Managing director
> 
> Mobile: +49 174 9335695
> E-Mail: martin.ver...@croit.io <mailto:martin.ver...@croit.io>
> Chat: https://t.me/MartinVerges <https://t.me/MartinVerges>
> 
> croit GmbH, Freseniusstr. 31h, 81247 Munich
> CEO: Martin Verges - VAT-ID: DE310638492
> Com. register: Amtsgericht Munich HRB 231263
> 
> Web: https://croit.io <https://croit.io/>
> YouTube: https://goo.gl/PGE1Bx <https://goo.gl/PGE1Bx>
> 
> 
> 2018-11-09 17:35 GMT+01:00 Vlad Kopylov <vladk...@gmail.com 
> <mailto:vladk...@gmail.com>>:
> > Please disregard pg status, one of test vms was down for some time it is
> > healing.
> > Question only how to make it read from proper datacenter
> >
> > If you have an example.
> >
> > Thanks
> >
> >
> > On Fri, Nov 9, 2018 at 11:28 AM Vlad Kopylov <vladk...@gmail.com 
> > <mailto:vladk...@gmail.com>> wrote:
> >>
> >> Martin, thank you for the tip.
> >> googling ceph crush rule examples doesn't give much on rules, just static
> >> placement of buckets.
> >> this all seems to be for placing data, not to giving client in specific
> >> datacenter proper read osd
> >>
> >> maybe something wrong with placement groups?
> >>
> >> I added datacenter dc1 dc2 dc3
> >> Current replicated_rule is
> >>
> >> rule replicated_rule {
> >>         id 0
> >> type replicated
> >>         min_size 1
> >>         max_size 10
> >>         step take default
> >>         step chooseleaf firstn 0 type host
> >>         step emit
> >> }
> >>
> >> # buckets
> >> host ceph1 {
> >> id -3 # do not change unnecessarily
> >> id -2 class ssd # do not change unnecessarily
> >> # weight 1.000
> >> alg straw2
> >> hash 0 # rjenkins1
> >> item osd.0 weight 1.000
> >> }
> >> datacenter dc1 {
> >> id -9 # do not change unnecessarily
> >> id -4 class ssd # do not change unnecessarily
> >> # weight 1.000
> >> alg straw2
> >> hash 0 # rjenkins1
> >> item ceph1 weight 1.000
> >> }
> >> host ceph2 {
> >> id -5 # do not change unnecessarily
> >> id -6 class ssd # do not change unnecessarily
> >> # weight 1.000
> >> alg straw2
> >> hash 0 # rjenkins1
> >> item osd.1 weight 1.000
> >> }
> >> datacenter dc2 {
> >> id -10 # do not change unnecessarily
> >> id -8 class ssd # do not change unnecessarily
> >> # weight 1.000
> >> alg straw2
> >> hash 0 # rjenkins1
> >> item ceph2 weight 1.000
> >> }
> >> host ceph3 {
> >> id -7 # do not change unnecessarily
> >> id -12 class ssd # do not change unnecessarily
> >> # weight 1.000
> >> alg straw2
> >> hash 0 # rjenkins1
> >> item osd.2 weight 1.000
> >> }
> >> datacenter dc3 {
> >> id -11 # do not change unnecessarily
> >> id -13 class ssd # do not change unnecessarily
> >> # weight 1.000
> >> alg straw2
> >> hash 0 # rjenkins1
> >> item ceph3 weight 1.000
> >> }
> >> root default {
> >> id -1 # do not change unnecessarily
> >> id -14 class ssd # do not change unnecessarily
> >> # weight 3.000
> >> alg straw2
> >> hash 0 # rjenkins1
> >> item dc1 weight 1.000
> >> item dc2 weight 1.000
> >> item dc3 weight 1.000
> >> }
> >>
> >>
> >> #ceph pg dump
> >> dumped all
> >> version 29433
> >> stamp 2018-11-09 11:23:44.510872
> >> last_osdmap_epoch 0
> >> last_pg_scan 0
> >> PG_STAT OBJECTS MISSING_ON_PRIMARY DEGRADED MISPLACED UNFOUND BYTES    LOG
> >> DISK_LOG STATE                      STATE_STAMP                VERSION
> >> REPORTED UP      UP_PRIMARY ACTING  ACTING_PRIMARY LAST_SCRUB SCRUB_STAMP
> >> LAST_DEEP_SCRUB DEEP_SCRUB_STAMP           SNAPTRIMQ_LEN
> >> 1.5f          0                  0        0         0       0        0
> >> 0        0               active+clean 2018-11-09 04:35:32.320607      0'0
> >> 544:1317 [0,2,1]          0 [0,2,1]              0        0'0 2018-11-09
> >> 04:35:32.320561             0'0 2018-11-04 11:55:54.756115             0
> >> 2.5c        143                  0      143         0       0 19490267
> >> 461      461 active+undersized+degraded 2018-11-08 19:02:03.873218  508'461
> >> 544:2100   [2,1]          2   [2,1]              2    290'380 2018-11-07
> >> 18:58:43.043719          64'120 2018-11-05 14:21:49.256324             0
> >> .....
> >> sum 15239 0 2053 2659 0 2157615019 58286 58286
> >> OSD_STAT USED    AVAIL  TOTAL  HB_PEERS PG_SUM PRIMARY_PG_SUM
> >> 2        3.7 GiB 28 GiB 32 GiB    [0,1]    200             73
> >> 1        3.7 GiB 28 GiB 32 GiB    [0,2]    200             58
> >> 0        3.7 GiB 28 GiB 32 GiB    [1,2]    173             69
> >> sum       11 GiB 85 GiB 96 GiB
> >>
> >> #ceph pg map 2.5c
> >> osdmap e545 pg 2.5c (2.5c) -> up [2,1] acting [2,1]
> >>
> >> #pg map 1.5f
> >> osdmap e547 pg 1.5f (1.5f) -> up [0,2,1] acting [0,2,1]
> >>
> >>
> >> On Fri, Nov 9, 2018 at 2:21 AM Martin Verges <martin.ver...@croit.io 
> >> <mailto:martin.ver...@croit.io>>
> >> wrote:
> >>>
> >>> Hello Vlad,
> >>>
> >>> Ceph clients connect to the primary OSD of each PG. If you create a
> >>> crush rule for building1 and one for building2 that takes a OSD from
> >>> the same building as the first one, your reads to the pool will always
> >>> be on the same building (if the cluster is healthy) and only write
> >>> request get replicated to the other building.
> >>>
> >>> --
> >>> Martin Verges
> >>> Managing director
> >>>
> >>> Mobile: +49 174 9335695
> >>> E-Mail: martin.ver...@croit.io <mailto:martin.ver...@croit.io>
> >>> Chat: https://t.me/MartinVerges <https://t.me/MartinVerges>
> >>>
> >>> croit GmbH, Freseniusstr. 31h, 81247 Munich
> >>> CEO: Martin Verges - VAT-ID: DE310638492
> >>> Com. register: Amtsgericht Munich HRB 231263
> >>>
> >>> Web: https://croit.io <https://croit.io/>
> >>> YouTube: https://goo.gl/PGE1Bx <https://goo.gl/PGE1Bx>
> >>>
> >>>
> >>> 2018-11-09 4:54 GMT+01:00 Vlad Kopylov <vladk...@gmail.com 
> >>> <mailto:vladk...@gmail.com>>:
> >>> > I am trying to test replicated ceph with servers in different
> >>> > buildings, and
> >>> > I have a read problem.
> >>> > Reads from one building go to osd in another building and vice versa,
> >>> > making
> >>> > reads slower then writes! Making read as slow as slowest node.
> >>> >
> >>> > Is there a way to
> >>> > - disable parallel read (so it reads only from the same osd node where
> >>> > mon
> >>> > is);
> >>> > - or give each client read restriction per osd?
> >>> > - or maybe strictly specify read osd on mount;
> >>> > - or have node read delay cap (for example if node time out is larger
> >>> > then 2
> >>> > ms then do not use such node for read as other replicas are available).
> >>> > - or ability to place Clients on the Crush map - so it understands that
> >>> > osd
> >>> > in - for example osd in the same data-center as client has preference,
> >>> > and
> >>> > pull data from it/them.
> >>> >
> >>> > Mounting with kernel client latest mimic.
> >>> >
> >>> > Thank you!
> >>> >
> >>> > Vlad
> >>> >
> >>> > _______________________________________________
> >>> > ceph-users mailing list
> >>> > ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
> >>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> >>> > <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
> >>> >
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] read performance, separate client CRUSH maps or limit osd read access from each client

Reply via email to