Also what tuned profile are you using? There is something to be gained by using a matching tuned profile for your workload.
On Mon, Nov 27, 2017 at 11:16 AM, Donny Davis <do...@fortnebula.com> wrote: > Why not ask Red Hat? All the rest of the storage vendors you are looking > at are not free. > > Full disclosure, I am an employee at Red Hat. > > On Mon, Nov 27, 2017 at 10:16 AM, Nick Fisk <n...@fisk.me.uk> wrote: > >> *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On Behalf >> Of *German Anders >> *Sent:* 27 November 2017 14:44 >> *To:* Maged Mokhtar <mmokh...@petasan.org> >> *Cc:* ceph-users <ceph-users@lists.ceph.com> >> *Subject:* Re: [ceph-users] ceph all-nvme mysql performance tuning >> >> >> >> Hi Maged, >> >> >> >> Thanks a lot for the response. We try with different number of threads >> and we're getting almost the same kind of difference between the storage >> types. Going to try with different rbd stripe size, object size values and >> see if we get more competitive numbers. Will get back with more tests and >> param changes to see if we get better :) >> >> >> >> >> >> Just to echo a couple of comments. Ceph will always struggle to match the >> performance of a traditional array for mainly 2 reasons. >> >> >> >> 1. You are replacing some sort of dual ported SAS or internally RDMA >> connected device with a network for Ceph replication traffic. This will >> instantly have a large impact on write latency >> 2. Ceph locks at the PG level and a PG will most likely cover at >> least one 4MB object, so lots of small accesses to the same blocks (on a >> block device) will wait on each other and go effectively at a single >> threaded rate. >> >> >> >> The best thing you can do to mitigate these, is to run the fastest >> journal/WAL devices you can, fastest network connections (ie 25Gb/s) and >> run your CPU’s at max C and P states. >> >> >> >> You stated that you are running the performance profile on the CPU’s. >> Could you also just double check that the C-states are being held at C1(e)? >> There are a few utilities that can show this in realtime. >> >> >> >> Other than that, although there could be some minor tweaks, you are >> probably nearing the limit of what you can hope to achieve. >> >> >> >> Nick >> >> >> >> >> >> Thanks, >> >> >> >> Best, >> >> >> *German* >> >> >> >> 2017-11-27 11:36 GMT-03:00 Maged Mokhtar <mmokh...@petasan.org>: >> >> On 2017-11-27 15:02, German Anders wrote: >> >> Hi All, >> >> >> >> I've a performance question, we recently install a brand new Ceph cluster >> with all-nvme disks, using ceph version 12.2.0 with bluestore configured. >> The back-end of the cluster is using a bond IPoIB (active/passive) , and >> for the front-end we are using a bonding config with active/active (20GbE) >> to communicate with the clients. >> >> >> >> The cluster configuration is the following: >> >> >> >> *MON Nodes:* >> >> OS: Ubuntu 16.04.3 LTS | kernel 4.12.14 >> >> 3x 1U servers: >> >> 2x Intel Xeon E5-2630v4 @2.2Ghz >> >> 128G RAM >> >> 2x Intel SSD DC S3520 150G (in RAID-1 for OS) >> >> 2x 82599ES 10-Gigabit SFI/SFP+ Network Connection >> >> >> >> *OSD Nodes:* >> >> OS: Ubuntu 16.04.3 LTS | kernel 4.12.14 >> >> 4x 2U servers: >> >> 2x Intel Xeon E5-2640v4 @2.4Ghz >> >> 128G RAM >> >> 2x Intel SSD DC S3520 150G (in RAID-1 for OS) >> >> 1x Ethernet Controller 10G X550T >> >> 1x 82599ES 10-Gigabit SFI/SFP+ Network Connection >> >> 12x Intel SSD DC P3520 1.2T (NVMe) for OSD daemons >> >> 1x Mellanox ConnectX-3 InfiniBand FDR 56Gb/s Adapter (dual port) >> >> >> >> >> >> Here's the tree: >> >> >> >> ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF >> >> -7 48.00000 root root >> >> -5 24.00000 rack rack1 >> >> -1 12.00000 node cpn01 >> >> 0 nvme 1.00000 osd.0 up 1.00000 1.00000 >> >> 1 nvme 1.00000 osd.1 up 1.00000 1.00000 >> >> 2 nvme 1.00000 osd.2 up 1.00000 1.00000 >> >> 3 nvme 1.00000 osd.3 up 1.00000 1.00000 >> >> 4 nvme 1.00000 osd.4 up 1.00000 1.00000 >> >> 5 nvme 1.00000 osd.5 up 1.00000 1.00000 >> >> 6 nvme 1.00000 osd.6 up 1.00000 1.00000 >> >> 7 nvme 1.00000 osd.7 up 1.00000 1.00000 >> >> 8 nvme 1.00000 osd.8 up 1.00000 1.00000 >> >> 9 nvme 1.00000 osd.9 up 1.00000 1.00000 >> >> 10 nvme 1.00000 osd.10 up 1.00000 1.00000 >> >> 11 nvme 1.00000 osd.11 up 1.00000 1.00000 >> >> -3 12.00000 node cpn03 >> >> 24 nvme 1.00000 osd.24 up 1.00000 1.00000 >> >> 25 nvme 1.00000 osd.25 up 1.00000 1.00000 >> >> 26 nvme 1.00000 osd.26 up 1.00000 1.00000 >> >> 27 nvme 1.00000 osd.27 up 1.00000 1.00000 >> >> 28 nvme 1.00000 osd.28 up 1.00000 1.00000 >> >> 29 nvme 1.00000 osd.29 up 1.00000 1.00000 >> >> 30 nvme 1.00000 osd.30 up 1.00000 1.00000 >> >> 31 nvme 1.00000 osd.31 up 1.00000 1.00000 >> >> 32 nvme 1.00000 osd.32 up 1.00000 1.00000 >> >> 33 nvme 1.00000 osd.33 up 1.00000 1.00000 >> >> 34 nvme 1.00000 osd.34 up 1.00000 1.00000 >> >> 35 nvme 1.00000 osd.35 up 1.00000 1.00000 >> >> -6 24.00000 rack rack2 >> >> -2 12.00000 node cpn02 >> >> 12 nvme 1.00000 osd.12 up 1.00000 1.00000 >> >> 13 nvme 1.00000 osd.13 up 1.00000 1.00000 >> >> 14 nvme 1.00000 osd.14 up 1.00000 1.00000 >> >> 15 nvme 1.00000 osd.15 up 1.00000 1.00000 >> >> 16 nvme 1.00000 osd.16 up 1.00000 1.00000 >> >> 17 nvme 1.00000 osd.17 up 1.00000 1.00000 >> >> 18 nvme 1.00000 osd.18 up 1.00000 1.00000 >> >> 19 nvme 1.00000 osd.19 up 1.00000 1.00000 >> >> 20 nvme 1.00000 osd.20 up 1.00000 1.00000 >> >> 21 nvme 1.00000 osd.21 up 1.00000 1.00000 >> >> 22 nvme 1.00000 osd.22 up 1.00000 1.00000 >> >> 23 nvme 1.00000 osd.23 up 1.00000 1.00000 >> >> -4 12.00000 node cpn04 >> >> 36 nvme 1.00000 osd.36 up 1.00000 1.00000 >> >> 37 nvme 1.00000 osd.37 up 1.00000 1.00000 >> >> 38 nvme 1.00000 osd.38 up 1.00000 1.00000 >> >> 39 nvme 1.00000 osd.39 up 1.00000 1.00000 >> >> 40 nvme 1.00000 osd.40 up 1.00000 1.00000 >> >> 41 nvme 1.00000 osd.41 up 1.00000 1.00000 >> >> 42 nvme 1.00000 osd.42 up 1.00000 1.00000 >> >> 43 nvme 1.00000 osd.43 up 1.00000 1.00000 >> >> 44 nvme 1.00000 osd.44 up 1.00000 1.00000 >> >> 45 nvme 1.00000 osd.45 up 1.00000 1.00000 >> >> 46 nvme 1.00000 osd.46 up 1.00000 1.00000 >> >> 47 nvme 1.00000 osd.47 up 1.00000 1.00000 >> >> >> >> The disk partition of one of the OSD nodes: >> >> >> >> NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT >> >> nvme6n1 259:1 0 1.1T 0 disk >> >> ├─nvme6n1p2 259:15 0 1.1T 0 part >> >> └─nvme6n1p1 259:13 0 100M 0 part >> /var/lib/ceph/osd/ceph-6 >> >> nvme9n1 259:0 0 1.1T 0 disk >> >> ├─nvme9n1p2 259:8 0 1.1T 0 part >> >> └─nvme9n1p1 259:7 0 100M 0 part >> /var/lib/ceph/osd/ceph-9 >> >> sdb 8:16 0 139.8G 0 disk >> >> └─sdb1 8:17 0 139.8G 0 part >> >> └─md0 9:0 0 139.6G 0 raid1 >> >> ├─md0p2 259:31 0 1K 0 md >> >> ├─md0p5 259:32 0 139.1G 0 md >> >> │ ├─cpn01--vg-swap 253:1 0 27.4G 0 lvm [SWAP] >> >> │ └─cpn01--vg-root 253:0 0 111.8G 0 lvm / >> >> └─md0p1 259:30 0 486.3M 0 md /boot >> >> nvme11n1 259:2 0 1.1T 0 disk >> >> ├─nvme11n1p1 259:12 0 100M 0 part >> /var/lib/ceph/osd/ceph-11 >> >> └─nvme11n1p2 259:14 0 1.1T 0 part >> >> nvme2n1 259:6 0 1.1T 0 disk >> >> ├─nvme2n1p2 259:21 0 1.1T 0 part >> >> └─nvme2n1p1 259:20 0 100M 0 part >> /var/lib/ceph/osd/ceph-2 >> >> nvme5n1 259:3 0 1.1T 0 disk >> >> ├─nvme5n1p1 259:9 0 100M 0 part >> /var/lib/ceph/osd/ceph-5 >> >> └─nvme5n1p2 259:10 0 1.1T 0 part >> >> nvme8n1 259:24 0 1.1T 0 disk >> >> ├─nvme8n1p1 259:26 0 100M 0 part >> /var/lib/ceph/osd/ceph-8 >> >> └─nvme8n1p2 259:28 0 1.1T 0 part >> >> nvme10n1 259:11 0 1.1T 0 disk >> >> ├─nvme10n1p1 259:22 0 100M 0 part >> /var/lib/ceph/osd/ceph-10 >> >> └─nvme10n1p2 259:23 0 1.1T 0 part >> >> nvme1n1 259:33 0 1.1T 0 disk >> >> ├─nvme1n1p1 259:34 0 100M 0 part >> /var/lib/ceph/osd/ceph-1 >> >> └─nvme1n1p2 259:35 0 1.1T 0 part >> >> nvme4n1 259:5 0 1.1T 0 disk >> >> ├─nvme4n1p1 259:18 0 100M 0 part >> /var/lib/ceph/osd/ceph-4 >> >> └─nvme4n1p2 259:19 0 1.1T 0 part >> >> nvme7n1 259:25 0 1.1T 0 disk >> >> ├─nvme7n1p1 259:27 0 100M 0 part >> /var/lib/ceph/osd/ceph-7 >> >> └─nvme7n1p2 259:29 0 1.1T 0 part >> >> sda 8:0 0 139.8G 0 disk >> >> └─sda1 8:1 0 139.8G 0 part >> >> └─md0 9:0 0 139.6G 0 raid1 >> >> ├─md0p2 259:31 0 1K 0 md >> >> ├─md0p5 259:32 0 139.1G 0 md >> >> │ ├─cpn01--vg-swap 253:1 0 27.4G 0 lvm [SWAP] >> >> │ └─cpn01--vg-root 253:0 0 111.8G 0 lvm / >> >> └─md0p1 259:30 0 486.3M 0 md /boot >> >> nvme0n1 259:36 0 1.1T 0 disk >> >> ├─nvme0n1p1 259:37 0 100M 0 part >> /var/lib/ceph/osd/ceph-0 >> >> └─nvme0n1p2 259:38 0 1.1T 0 part >> >> nvme3n1 259:4 0 1.1T 0 disk >> >> ├─nvme3n1p1 259:16 0 100M 0 part >> /var/lib/ceph/osd/ceph-3 >> >> └─nvme3n1p2 259:17 0 1.1T 0 part >> >> >> >> >> >> For the disk scheduler we're using [kyber], for the read_ahead_kb we try >> different values (0,128 and 2048), the rq_affinity set to 2, and the >> rotational parameter set to 0. >> >> We've also set the CPU governor to performance on all the cores, and tune >> some sysctl parameters also: >> >> >> >> # for Ceph >> >> net.ipv4.ip_forward=0 >> >> net.ipv4.conf.default.rp_filter=1 >> >> kernel.sysrq=0 >> >> kernel.core_uses_pid=1 >> >> net.ipv4.tcp_syncookies=0 >> >> #net.netfilter.nf_conntrack_max=2621440 >> >> #net.netfilter.nf_conntrack_tcp_timeout_established = 1800 >> >> # disable netfilter on bridges >> >> #net.bridge.bridge-nf-call-ip6tables = 0 >> >> #net.bridge.bridge-nf-call-iptables = 0 >> >> #net.bridge.bridge-nf-call-arptables = 0 >> >> vm.min_free_kbytes=1000000 >> >> >> >> # Controls the maximum size of a message, in bytes >> >> kernel.msgmnb = 65536 >> >> >> >> # Controls the default maxmimum size of a mesage queue >> >> kernel.msgmax = 65536 >> >> >> >> # Controls the maximum shared segment size, in bytes >> >> kernel.shmmax = 68719476736 >> >> >> >> # Controls the maximum number of shared memory segments, in pages >> >> kernel.shmall = 4294967296 >> >> >> >> >> >> The ceph.conf file is: >> >> >> >> ... >> >> osd_pool_default_size = 3 >> >> osd_pool_default_min_size = 2 >> >> osd_pool_default_pg_num = 1600 >> >> osd_pool_default_pgp_num = 1600 >> >> >> >> debug_crush = 1/1 >> >> debug_buffer = 0/1 >> >> debug_timer = 0/0 >> >> debug_filer = 0/1 >> >> debug_objecter = 0/1 >> >> debug_rados = 0/5 >> >> debug_rbd = 0/5 >> >> debug_ms = 0/5 >> >> debug_throttle = 1/1 >> >> >> >> debug_journaler = 0/0 >> >> debug_objectcatcher = 0/0 >> >> debug_client = 0/0 >> >> debug_osd = 0/0 >> >> debug_optracker = 0/0 >> >> debug_objclass = 0/0 >> >> debug_journal = 0/0 >> >> debug_filestore = 0/0 >> >> debug_mon = 0/0 >> >> debug_paxos = 0/0 >> >> >> >> osd_crush_chooseleaf_type = 0 >> >> filestore_xattr_use_omap = true >> >> >> >> rbd_cache = true >> >> mon_compact_on_trim = false >> >> >> >> [osd] >> >> osd_crush_update_on_start = false >> >> >> >> [client] >> >> rbd_cache = true >> >> rbd_cache_writethrough_until_flush = true >> >> rbd_default_features = 1 >> >> admin_socket = /var/run/ceph/$cluster-$type.$id.$pid.$cctid.asok >> >> log_file = /var/log/ceph/ >> >> >> >> >> >> The cluster has two production pools on for openstack (volumes) with RF >> of 3 and another pool for db (db) with RF of 2. The DBA team has perform >> several tests with a volume mounted on the DB server (with RBD). The DB >> server has the following configuration: >> >> >> >> OS: CentOS 6.9 | kernel 4.14.1 >> >> DB: MySQL >> >> ProLiant BL685c G7 >> >> 4x AMD Opteron Processor 6376 (total of 64 cores) >> >> 128G RAM >> >> 1x OneConnect 10Gb NIC (quad-port) - in a bond configuration >> (active/active) with 3 vlans >> >> >> >> >> >> >> >> We also did some tests with *sysbench* on different storage types: >> >> >> >> *sysbench* >> >> *disk* >> >> *tps* >> >> *qps* >> >> *latency (ms) 95th percentile* >> >> Local SSD >> >> 261,28 >> >> 5.225,61 >> >> 5,18 >> >> Ceph NVMe >> >> 95,18 >> >> 1.903,53 >> >> 12,3 >> >> Pure Storage >> >> 196,49 >> >> 3.929,71 >> >> 6,32 >> >> NetApp FAS >> >> 189,83 >> >> 3.796,59 >> >> 6,67 >> >> EMC VMAX >> >> 196,14 >> >> 3.922,82 >> >> 6,32 >> >> >> >> >> >> Is there any specific tuning that I can apply to the ceph cluster, in >> order to improve those numbers? Or are those numbers ok for the type and >> size of the cluster that we have? Any advice would be really appreciated. >> >> >> >> Thanks, >> >> >> >> >> >> >> >> *German* >> >> >> >> >> >> _______________________________________________ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> Hi, >> >> What is the value of --num-threads (def value is 1) ? Ceph will be better >> with more threads: 32 or 64. >> What is the value of --file-block-size (def 16k) and file-test-mode ? If >> you are using sequential seqwr/seqrd you will be hitting the same OSD, so >> maybe try random (rndrd/rndwr) or better use rbd stripe size of 16kb >> (default rbd stripe is 4M). rbd striping is ideal for small block >> sequential io pattern typical in databases. >> >> /Maged >> >> >> >> _______________________________________________ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com