Re: [ceph-users] Ceph InfiniBand Cluster - Jewel - Performance

German Anders Thu, 07 Apr 2016 13:09:56 -0700

also jewel does not supposed to get more 'performance', since it used
bluestore in order to store metadata. Or do I need to specify during
install to use bluestore?


Thanks,


*German*

2016-04-07 16:55 GMT-03:00 Robert LeBlanc <rob...@leblancnet.us>:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
>
> Ceph is not able to use native Infiniband protocols yet and so it is
> only leveraging IPoIB at the moment. The most likely reason you are
> only getting ~10 Gb performance is that IPoIB heavily leverages
> multicast in Infiniband (if you do so research in this area you will
> understand why unicast IP still uses multicast on an Inifiniband
> network). To be extremely compatible with all adapters, the subnet
> manager will set the speed of multicast to 10 Gb/s so that SDR
> adapters can be used and not drop packets. If you know that you will
> never have adapters under a certain speed, you can configure the
> subnet manager to use a higher speed. This does not change IPoIB
> networks that are already configured (I had to down all the IPoIB
> adapter at the same time and bring them back up to upgrade the speed).
> Even after that, there still wasn't similar performance to native
> Infiniband, but I got at least a 2x improvement (along with setting
> the MTU to 64K) on the FDR adapters. There is still a ton of overhead
> for doing IPoIB so it is not an ideal transport to get performance on
> Infiniband, I think of it as a compatibility feature. Hopefully, that
> will give you enough information to perform the research. If you
> search the OFED mailing list, you will see some posts from me 2-3
> years ago regarding this very topic.
>
> Good luck and keep holding out for Ceph with XIO.
> -----BEGIN PGP SIGNATURE-----
> Version: Mailvelope v1.3.6
> Comment: https://www.mailvelope.com
>
> wsFcBAEBCAAQBQJXBrtICRDmVDuy+mK58QAAqVkP/2hpe93FYIbQtpV4Qta4
> 9Fohqf478kVPX/v6XkAYOlAFFAISxfbDdm0FxOjbGSEOMKGNs/oSaFRCsqb9
> +T5dfMUHyhY51wyaNeVF3k3zgvGpNUO1xEQ1IenUquZp9825VRBze5/T6r8Z
> PMFySNtuHBp8AhARisPJcXqKv/Vowfy/LqyvlL6ytIHfwqsVHngbtVN7L/HX
> vzMZM93cLwwV44v2bT8t63U76GKyQpbksDx02CktMIFzNbfApsiMaA1dyx1O
> 9HEgirtddMO358f+1DN/OjNc/Z3zECILaw3tq/HUWJyBJqO95uBw++znIacb
> UKwqJ1HmUeDvdqY72ZQa2fQT7ayMMlPPwzoVtdQGMZnSaAjn8MlunDFCrdLw
> +JPT+kt0qnjzs9qK0zEp5drfUwnV5BXS4hZhKUvuxWmVjUv1EfJrIFCszSFO
> 2be/xLxqBTpCEcHL9fsc16P7HsrdBW8GDy3X5PC2sOl/2DSes4y2TpCfr7w9
> V8Mhs7mmkEQtwcvyaYQ0bx0Bs3o4cvTTeYbJUpLWEgMmGAEBZbf7Sx+y3dIp
> jUHb2jPEchBb83BGeLvAkCTfouq/J3pzQK96gA2Kh/KOlVJTpFdKUU5x+wpM
> ACqD+S/AFkgnfGm4fcgBexhro7GImiO6VIaOdxvTSdQbSsaoKckZOxFhVWih
> XyBJ
> =EF9A
> -----END PGP SIGNATURE-----
> ----------------
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>
>
> On Thu, Apr 7, 2016 at 1:43 PM, German Anders <gand...@despegar.com>
> wrote:
> > Hi Cephers,
> >
> > I've setup a production environment Ceph cluster with the Jewel release
> > (10.1.0 (96ae8bd25f31862dbd5302f304ebf8bf1166aba6)) consisting of 3 MON
> > Servers and 6 OSD Servers:
> >
> > 3x MON Servers:
> > 2x Intel Xeon E5-2630v3@2.40Ghz
> > 384GB RAM
> > 2x 200G Intel DC3700 in RAID-1 for OS
> > 1x InfiniBand ConnectX-3 ADPT DP
> >
> > 6x OSD Servers:
> > 2x Intel Xeon E5-2650v2@2.60Ghz
> > 128GB RAM
> > 2x 200G Intel DC3700 in RAID-1 for OS
> > 12x 800G Intel DC3510 (osd & journal) on same device
> > 1x InfiniBand ConnectX-3 ADPT DP (one port on PUB network and the other
> on
> > the CLUS network)
> >
> > ceph.conf file is:
> >
> > [global]
> > fsid = xxxxxxxxxxxxxxxxxxxxxxxxxxx
> > mon_initial_members = cibm01, cibm02, cibm03
> > mon_host = xx.xx.xx.1,xx.xx.xx.2,xx.xx.xx.3
> > auth_cluster_required = cephx
> > auth_service_required = cephx
> > auth_client_required = cephx
> > filestore_xattr_use_omap = true
> > public_network = xx.xx.16.0/20
> > cluster_network = xx.xx.32.0/20
> >
> > [mon]
> >
> > [mon.cibm01]
> > host = cibm01
> > mon_addr = xx.xx.xx.1:6789
> >
> > [mon.cibm02]
> > host = cibm02
> > mon_addr = xx.xx.xx.2:6789
> >
> > [mon.cibm03]
> > host = cibm03
> > mon_addr = xx.xx.xx.3:6789
> >
> > [osd]
> > osd_pool_default_size = 2
> > osd_pool_default_min_size = 1
> >
> > ## OSD Configuration ##
> > [osd.0]
> > host = cibn01
> > public_addr = xx.xx.17.1
> > cluster_addr = xx.xx.32.1
> >
> > [osd.1]
> > host = cibn01
> > public_addr = xx.xx.17.1
> > cluster_addr = xx.xx.32.1
> >
> > ...
> >
> >
> >
> > They are all running Ubuntu 14.04.4 LTS. Journals are 5GB partitions on
> each
> > disk, since all the OSD daemons are SSD disks (Intel DC3510 800G). For
> > example:
> >
> > sdc                              8:32   0 745.2G  0 disk
> > |-sdc1                           8:33   0 740.2G  0 part
> > /var/lib/ceph/osd/ceph-0
> > `-sdc2                           8:34   0     5G  0 part
> >
> > The purpose of this cluster will be to serve as a backend storage for
> Cinder
> > volumes (RBD) and Glance images in an OpenStack cloud, most of the
> clusters
> > on OpenStack will be non-relational databases like Cassandra with many
> > instances each.
> >
> > All of the nodes of the cluster are running InfiniBand FDR 56Gb/s with
> > Mellanox Technologies MT27500 Family [ConnectX-3] adapters.
> >
> >
> > So I assume that performance will be really nice, right?...but.. I'm
> getting
> > some numbers that I think they could be really more important.
> >
> > # rados --pool rbd bench 10 write -t 16
> >
> > Total writes made:      1964
> > Write size:             4194304
> > Object size:            4194304
> > Bandwidth (MB/sec):     755.435
> >
> > Stddev Bandwidth:       90.3288
> > Max bandwidth (MB/sec): 884
> > Min bandwidth (MB/sec): 612
> > Average IOPS:           188
> > Stddev IOPS:            22
> > Max IOPS:               221
> > Min IOPS:               153
> > Average Latency(s):     0.0836802
> > Stddev Latency(s):      0.147561
> > Max latency(s):         1.50925
> > Min latency(s):         0.0192736
> >
> >
> > Then I connect to another server (this one is running on QDR - so I would
> > expect something between 2-3Gb/s), I map a RBD on the host, then create a
> > ext4 fs and mount it, and finally run a fio test:
> >
> > # fio --rw=randwrite --bs=4M --numjobs=8 --iodepth=32 --runtime=22
> > --time_based --size=10G --loops=1 --ioengine=libaio --direct=1
> > --invalidate=1 --fsync_on_close=1 --randrepeat=1 --norandommap
> > --group_reporting --exitall --name cephV1 --filename=/mnt/host01v1/test1
> >
> > fio-2.1.3
> > Starting 8 processes
> > cephIBV1: Laying out IO file(s) (1 file(s) / 10240MB)
> > Jobs: 7 (f=7): [wwwwww_w] [100.0% done] [0KB/431.6MB/0KB /s] [0/107/0
> iops]
> > [eta 00m:00s]
> > cephIBV1: (groupid=0, jobs=8): err= 0: pid=6203: Thu Apr  7 15:24:12 2016
> >   write: io=15284MB, bw=676412KB/s, iops=165, runt= 23138msec
> >     slat (msec): min=1, max=480, avg=46.15, stdev=63.68
> >     clat (msec): min=64, max=8966, avg=1459.91, stdev=1252.64
> >      lat (msec): min=87, max=8969, avg=1506.06, stdev=1253.63
> >     clat percentiles (msec):
> >      |  1.00th=[  235],  5.00th=[  478], 10.00th=[  611], 20.00th=[
> 766],
> >      | 30.00th=[  889], 40.00th=[  988], 50.00th=[ 1106], 60.00th=[
> 1237],
> >      | 70.00th=[ 1434], 80.00th=[ 1680], 90.00th=[ 2474], 95.00th=[
> 4555],
> >      | 99.00th=[ 6915], 99.50th=[ 7439], 99.90th=[ 8291], 99.95th=[
> 8586],
> >      | 99.99th=[ 8979]
> >     bw (KB  /s): min= 3091, max=209877, per=12.31%, avg=83280.51,
> > stdev=35226.98
> >     lat (msec) : 100=0.16%, 250=0.97%, 500=4.61%, 750=12.93%, 1000=22.61%
> >     lat (msec) : 2000=45.04%, >=2000=13.69%
> >   cpu          : usr=0.87%, sys=4.77%, ctx=6803, majf=0, minf=16337
> >   IO depths    : 1=0.2%, 2=0.4%, 4=0.8%, 8=1.7%, 16=3.3%, 32=93.5%,
> >>=64=0.0%
> >      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
> >>=64=0.0%
> >      complete  : 0=0.0%, 4=99.8%, 8=0.0%, 16=0.0%, 32=0.2%, 64=0.0%,
> >>=64=0.0%
> >      issued    : total=r=0/w=3821/d=0, short=r=0/w=0/d=0
> >
> > Run status group 0 (all jobs):
> >   WRITE: io=15284MB, aggrb=676411KB/s, minb=676411KB/s, maxb=676411KB/s,
> > mint=23138msec, maxt=23138msec
> >
> > Disk stats (read/write):
> >   rbd1: ios=0/4189, merge=0/26613, ticks=0/2852032, in_queue=2857996,
> > util=99.08%
> >
> >
> > Does it look acceptable? I mean for an InfiniBand network, I guess that
> > throughput need to be better. How much more can I expect to achieve by
> > tuning the servers? The MTU on the OSD servers is:
> >
> > MTU: 65520
> > Any drop packet found
> > txqueuelen:256
> >
> > Also I've setup on the openib.conf file:
> > ...
> > SET_IPOIB_CM=yes
> > IPOIB_MTU=65520
> > ...
> >
> > And on mlnx.conf file:
> > ...
> >
> > options mlx4_core enable_sys_tune=1
> > options mlx4_core log_num_mgm_entry_size=-7
> >
> >
> > Anyone here with experience on Infiniband setups can give me any hint in
> > order to 'improve' performance, I'm getting similar numbers with another
> > cluster on a 10GbE network :S
> >
> >
> > Thanks,
> >
> > German
> >
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph InfiniBand Cluster - Jewel - Performance

Reply via email to