Hi David, Thanks a lot for the response. In fact, we first try to not use any scheduler at all, but then we try kyber iosched and we notice a slightly improve in terms of performance, that's why we actually keep it.
*German* 2017-11-27 13:48 GMT-03:00 David Byte <db...@suse.com>: > From the benchmarks I have seen and done myself, I’m not sure why you are > using an i/o scheduler at all with NVMe. While there are a few cases where > it may provide a slight benefit, simply having mq enabled with no scheduler > seems to provide the best performance for an all flash, especially all > NVMe, environment. > > > > David Byte > > Sr. Technology Strategist > > *SCE Enterprise Linux* > > *SCE Enterprise Storage* > > Alliances and SUSE Embedded > > db...@suse.com > > 918.528.4422 > > > > *From: *ceph-users <ceph-users-boun...@lists.ceph.com> on behalf of > German Anders <gand...@despegar.com> > *Date: *Monday, November 27, 2017 at 8:44 AM > *To: *Maged Mokhtar <mmokh...@petasan.org> > *Cc: *ceph-users <ceph-users@lists.ceph.com> > *Subject: *Re: [ceph-users] ceph all-nvme mysql performance tuning > > > > Hi Maged, > > > > Thanks a lot for the response. We try with different number of threads and > we're getting almost the same kind of difference between the storage types. > Going to try with different rbd stripe size, object size values and see if > we get more competitive numbers. Will get back with more tests and param > changes to see if we get better :) > > > > Thanks, > > > > Best, > > > *German* > > > > 2017-11-27 11:36 GMT-03:00 Maged Mokhtar <mmokh...@petasan.org>: > > On 2017-11-27 15:02, German Anders wrote: > > Hi All, > > > > I've a performance question, we recently install a brand new Ceph cluster > with all-nvme disks, using ceph version 12.2.0 with bluestore configured. > The back-end of the cluster is using a bond IPoIB (active/passive) , and > for the front-end we are using a bonding config with active/active (20GbE) > to communicate with the clients. > > > > The cluster configuration is the following: > > > > *MON Nodes:* > > OS: Ubuntu 16.04.3 LTS | kernel 4.12.14 > > 3x 1U servers: > > 2x Intel Xeon E5-2630v4 @2.2Ghz > > 128G RAM > > 2x Intel SSD DC S3520 150G (in RAID-1 for OS) > > 2x 82599ES 10-Gigabit SFI/SFP+ Network Connection > > > > *OSD Nodes:* > > OS: Ubuntu 16.04.3 LTS | kernel 4.12.14 > > 4x 2U servers: > > 2x Intel Xeon E5-2640v4 @2.4Ghz > > 128G RAM > > 2x Intel SSD DC S3520 150G (in RAID-1 for OS) > > 1x Ethernet Controller 10G X550T > > 1x 82599ES 10-Gigabit SFI/SFP+ Network Connection > > 12x Intel SSD DC P3520 1.2T (NVMe) for OSD daemons > > 1x Mellanox ConnectX-3 InfiniBand FDR 56Gb/s Adapter (dual port) > > > > > > Here's the tree: > > > > ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF > > -7 48.00000 root root > > -5 24.00000 rack rack1 > > -1 12.00000 node cpn01 > > 0 nvme 1.00000 osd.0 up 1.00000 1.00000 > > 1 nvme 1.00000 osd.1 up 1.00000 1.00000 > > 2 nvme 1.00000 osd.2 up 1.00000 1.00000 > > 3 nvme 1.00000 osd.3 up 1.00000 1.00000 > > 4 nvme 1.00000 osd.4 up 1.00000 1.00000 > > 5 nvme 1.00000 osd.5 up 1.00000 1.00000 > > 6 nvme 1.00000 osd.6 up 1.00000 1.00000 > > 7 nvme 1.00000 osd.7 up 1.00000 1.00000 > > 8 nvme 1.00000 osd.8 up 1.00000 1.00000 > > 9 nvme 1.00000 osd.9 up 1.00000 1.00000 > > 10 nvme 1.00000 osd.10 up 1.00000 1.00000 > > 11 nvme 1.00000 osd.11 up 1.00000 1.00000 > > -3 12.00000 node cpn03 > > 24 nvme 1.00000 osd.24 up 1.00000 1.00000 > > 25 nvme 1.00000 osd.25 up 1.00000 1.00000 > > 26 nvme 1.00000 osd.26 up 1.00000 1.00000 > > 27 nvme 1.00000 osd.27 up 1.00000 1.00000 > > 28 nvme 1.00000 osd.28 up 1.00000 1.00000 > > 29 nvme 1.00000 osd.29 up 1.00000 1.00000 > > 30 nvme 1.00000 osd.30 up 1.00000 1.00000 > > 31 nvme 1.00000 osd.31 up 1.00000 1.00000 > > 32 nvme 1.00000 osd.32 up 1.00000 1.00000 > > 33 nvme 1.00000 osd.33 up 1.00000 1.00000 > > 34 nvme 1.00000 osd.34 up 1.00000 1.00000 > > 35 nvme 1.00000 osd.35 up 1.00000 1.00000 > > -6 24.00000 rack rack2 > > -2 12.00000 node cpn02 > > 12 nvme 1.00000 osd.12 up 1.00000 1.00000 > > 13 nvme 1.00000 osd.13 up 1.00000 1.00000 > > 14 nvme 1.00000 osd.14 up 1.00000 1.00000 > > 15 nvme 1.00000 osd.15 up 1.00000 1.00000 > > 16 nvme 1.00000 osd.16 up 1.00000 1.00000 > > 17 nvme 1.00000 osd.17 up 1.00000 1.00000 > > 18 nvme 1.00000 osd.18 up 1.00000 1.00000 > > 19 nvme 1.00000 osd.19 up 1.00000 1.00000 > > 20 nvme 1.00000 osd.20 up 1.00000 1.00000 > > 21 nvme 1.00000 osd.21 up 1.00000 1.00000 > > 22 nvme 1.00000 osd.22 up 1.00000 1.00000 > > 23 nvme 1.00000 osd.23 up 1.00000 1.00000 > > -4 12.00000 node cpn04 > > 36 nvme 1.00000 osd.36 up 1.00000 1.00000 > > 37 nvme 1.00000 osd.37 up 1.00000 1.00000 > > 38 nvme 1.00000 osd.38 up 1.00000 1.00000 > > 39 nvme 1.00000 osd.39 up 1.00000 1.00000 > > 40 nvme 1.00000 osd.40 up 1.00000 1.00000 > > 41 nvme 1.00000 osd.41 up 1.00000 1.00000 > > 42 nvme 1.00000 osd.42 up 1.00000 1.00000 > > 43 nvme 1.00000 osd.43 up 1.00000 1.00000 > > 44 nvme 1.00000 osd.44 up 1.00000 1.00000 > > 45 nvme 1.00000 osd.45 up 1.00000 1.00000 > > 46 nvme 1.00000 osd.46 up 1.00000 1.00000 > > 47 nvme 1.00000 osd.47 up 1.00000 1.00000 > > > > The disk partition of one of the OSD nodes: > > > > NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT > > nvme6n1 259:1 0 1.1T 0 disk > > ├─nvme6n1p2 259:15 0 1.1T 0 part > > └─nvme6n1p1 259:13 0 100M 0 part /var/lib/ceph/osd/ceph-6 > > nvme9n1 259:0 0 1.1T 0 disk > > ├─nvme9n1p2 259:8 0 1.1T 0 part > > └─nvme9n1p1 259:7 0 100M 0 part /var/lib/ceph/osd/ceph-9 > > sdb 8:16 0 139.8G 0 disk > > └─sdb1 8:17 0 139.8G 0 part > > └─md0 9:0 0 139.6G 0 raid1 > > ├─md0p2 259:31 0 1K 0 md > > ├─md0p5 259:32 0 139.1G 0 md > > │ ├─cpn01--vg-swap 253:1 0 27.4G 0 lvm [SWAP] > > │ └─cpn01--vg-root 253:0 0 111.8G 0 lvm / > > └─md0p1 259:30 0 486.3M 0 md /boot > > nvme11n1 259:2 0 1.1T 0 disk > > ├─nvme11n1p1 259:12 0 100M 0 part > /var/lib/ceph/osd/ceph-11 > > └─nvme11n1p2 259:14 0 1.1T 0 part > > nvme2n1 259:6 0 1.1T 0 disk > > ├─nvme2n1p2 259:21 0 1.1T 0 part > > └─nvme2n1p1 259:20 0 100M 0 part /var/lib/ceph/osd/ceph-2 > > nvme5n1 259:3 0 1.1T 0 disk > > ├─nvme5n1p1 259:9 0 100M 0 part /var/lib/ceph/osd/ceph-5 > > └─nvme5n1p2 259:10 0 1.1T 0 part > > nvme8n1 259:24 0 1.1T 0 disk > > ├─nvme8n1p1 259:26 0 100M 0 part /var/lib/ceph/osd/ceph-8 > > └─nvme8n1p2 259:28 0 1.1T 0 part > > nvme10n1 259:11 0 1.1T 0 disk > > ├─nvme10n1p1 259:22 0 100M 0 part > /var/lib/ceph/osd/ceph-10 > > └─nvme10n1p2 259:23 0 1.1T 0 part > > nvme1n1 259:33 0 1.1T 0 disk > > ├─nvme1n1p1 259:34 0 100M 0 part /var/lib/ceph/osd/ceph-1 > > └─nvme1n1p2 259:35 0 1.1T 0 part > > nvme4n1 259:5 0 1.1T 0 disk > > ├─nvme4n1p1 259:18 0 100M 0 part /var/lib/ceph/osd/ceph-4 > > └─nvme4n1p2 259:19 0 1.1T 0 part > > nvme7n1 259:25 0 1.1T 0 disk > > ├─nvme7n1p1 259:27 0 100M 0 part /var/lib/ceph/osd/ceph-7 > > └─nvme7n1p2 259:29 0 1.1T 0 part > > sda 8:0 0 139.8G 0 disk > > └─sda1 8:1 0 139.8G 0 part > > └─md0 9:0 0 139.6G 0 raid1 > > ├─md0p2 259:31 0 1K 0 md > > ├─md0p5 259:32 0 139.1G 0 md > > │ ├─cpn01--vg-swap 253:1 0 27.4G 0 lvm [SWAP] > > │ └─cpn01--vg-root 253:0 0 111.8G 0 lvm / > > └─md0p1 259:30 0 486.3M 0 md /boot > > nvme0n1 259:36 0 1.1T 0 disk > > ├─nvme0n1p1 259:37 0 100M 0 part /var/lib/ceph/osd/ceph-0 > > └─nvme0n1p2 259:38 0 1.1T 0 part > > nvme3n1 259:4 0 1.1T 0 disk > > ├─nvme3n1p1 259:16 0 100M 0 part /var/lib/ceph/osd/ceph-3 > > └─nvme3n1p2 259:17 0 1.1T 0 part > > > > > > For the disk scheduler we're using [kyber], for the read_ahead_kb we try > different values (0,128 and 2048), the rq_affinity set to 2, and the > rotational parameter set to 0. > > We've also set the CPU governor to performance on all the cores, and tune > some sysctl parameters also: > > > > # for Ceph > > net.ipv4.ip_forward=0 > > net.ipv4.conf.default.rp_filter=1 > > kernel.sysrq=0 > > kernel.core_uses_pid=1 > > net.ipv4.tcp_syncookies=0 > > #net.netfilter.nf_conntrack_max=2621440 > > #net.netfilter.nf_conntrack_tcp_timeout_established = 1800 > > # disable netfilter on bridges > > #net.bridge.bridge-nf-call-ip6tables = 0 > > #net.bridge.bridge-nf-call-iptables = 0 > > #net.bridge.bridge-nf-call-arptables = 0 > > vm.min_free_kbytes=1000000 > > > > # Controls the maximum size of a message, in bytes > > kernel.msgmnb = 65536 > > > > # Controls the default maxmimum size of a mesage queue > > kernel.msgmax = 65536 > > > > # Controls the maximum shared segment size, in bytes > > kernel.shmmax = 68719476736 > > > > # Controls the maximum number of shared memory segments, in pages > > kernel.shmall = 4294967296 > > > > > > The ceph.conf file is: > > > > ... > > osd_pool_default_size = 3 > > osd_pool_default_min_size = 2 > > osd_pool_default_pg_num = 1600 > > osd_pool_default_pgp_num = 1600 > > > > debug_crush = 1/1 > > debug_buffer = 0/1 > > debug_timer = 0/0 > > debug_filer = 0/1 > > debug_objecter = 0/1 > > debug_rados = 0/5 > > debug_rbd = 0/5 > > debug_ms = 0/5 > > debug_throttle = 1/1 > > > > debug_journaler = 0/0 > > debug_objectcatcher = 0/0 > > debug_client = 0/0 > > debug_osd = 0/0 > > debug_optracker = 0/0 > > debug_objclass = 0/0 > > debug_journal = 0/0 > > debug_filestore = 0/0 > > debug_mon = 0/0 > > debug_paxos = 0/0 > > > > osd_crush_chooseleaf_type = 0 > > filestore_xattr_use_omap = true > > > > rbd_cache = true > > mon_compact_on_trim = false > > > > [osd] > > osd_crush_update_on_start = false > > > > [client] > > rbd_cache = true > > rbd_cache_writethrough_until_flush = true > > rbd_default_features = 1 > > admin_socket = /var/run/ceph/$cluster-$type.$id.$pid.$cctid.asok > > log_file = /var/log/ceph/ > > > > > > The cluster has two production pools on for openstack (volumes) with RF of > 3 and another pool for db (db) with RF of 2. The DBA team has perform > several tests with a volume mounted on the DB server (with RBD). The DB > server has the following configuration: > > > > OS: CentOS 6.9 | kernel 4.14.1 > > DB: MySQL > > ProLiant BL685c G7 > > 4x AMD Opteron Processor 6376 (total of 64 cores) > > 128G RAM > > 1x OneConnect 10Gb NIC (quad-port) - in a bond configuration > (active/active) with 3 vlans > > > > > > > > We also did some tests with *sysbench* on different storage types: > > > > *sysbench* > > *disk* > > *tps* > > *qps* > > *latency (ms) 95th percentile* > > Local SSD > > 261,28 > > 5.225,61 > > 5,18 > > Ceph NVMe > > 95,18 > > 1.903,53 > > 12,3 > > Pure Storage > > 196,49 > > 3.929,71 > > 6,32 > > NetApp FAS > > 189,83 > > 3.796,59 > > 6,67 > > EMC VMAX > > 196,14 > > 3.922,82 > > 6,32 > > > > > > Is there any specific tuning that I can apply to the ceph cluster, in > order to improve those numbers? Or are those numbers ok for the type and > size of the cluster that we have? Any advice would be really appreciated. > > > > Thanks, > > > > > > > > *German* > > > > > > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > Hi, > > What is the value of --num-threads (def value is 1) ? Ceph will be better > with more threads: 32 or 64. > What is the value of --file-block-size (def 16k) and file-test-mode ? If > you are using sequential seqwr/seqrd you will be hitting the same OSD, so > maybe try random (rndrd/rndwr) or better use rbd stripe size of 16kb > (default rbd stripe is 4M). rbd striping is ideal for small block > sequential io pattern typical in databases. > > /Maged > > >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com