On Wed, Jul 29, 2015 at 11:23 AM, Mark Nelson <mnel...@redhat.com> wrote:
> On 07/29/2015 10:13 AM, Jake Young wrote: > >> On Tue, Jul 28, 2015 at 11:48 AM, SCHAER Frederic >> <frederic.sch...@cea.fr <mailto:frederic.sch...@cea.fr>> wrote: >> > >> > Hi again, >> > >> > So I have tried >> > - changing the cpus frequency : either 1.6GHZ, or 2.4GHZ on all cores >> > - changing the memory configuration, from "advanced ecc mode" to >> "performance mode", boosting the memory bandwidth from 35GB/s to 40GB/s >> > - plugged a second 10GB/s link and setup a ceph internal network >> > - tried various "tuned-adm profile" such as "throughput-performance" >> > >> > This changed about nothing. >> > >> > If >> > - the CPUs are not maxed out, and lowering the frequency doesn't >> change a thing >> > - the network is not maxed out >> > - the memory doesn't seem to have an impact >> > - network interrupts are spread across all 8 cpu cores and receive >> queues are OK >> > - disks are not used at their maximum potential (iostat shows my dd >> commands produce much more tps than the 4MB ceph transfers...) >> > >> > Where can I possibly find a bottleneck ????? >> > >> > I'm /(almost) out of ideas/ ... :'( >> > >> > Regards >> > >> > >> Frederic, >> >> I was trying to optimize my ceph cluster as well and I looked at all of >> the same things you described, which didn't help my performance >> noticeably. >> >> The following network kernel tuning settings did help me significantly. >> >> This is my /etc/sysctl.conf file on all of my hosts: ceph mons, ceph >> osds and any client that connects to my ceph cluster. >> >> # Increase Linux autotuning TCP buffer limits >> # Set max to 16MB for 1GE and 32M (33554432) or 54M (56623104) >> for 10GE >> # Don't set tcp_mem itself! Let the kernel scale it based on RAM. >> #net.core.rmem_max = 56623104 >> #net.core.wmem_max = 56623104 >> # Use 128M buffers >> net.core.rmem_max = 134217728 >> net.core.wmem_max = 134217728 >> net.core.rmem_default = 67108864 >> net.core.wmem_default = 67108864 >> net.core.optmem_max = 134217728 >> net.ipv4.tcp_rmem = 4096 87380 67108864 >> net.ipv4.tcp_wmem = 4096 65536 67108864 >> >> # Make room for more TIME_WAIT sockets due to more clients, >> # and allow them to be reused if we run out of sockets >> # Also increase the max packet backlog >> net.core.somaxconn = 1024 >> # Increase the length of the processor input queue >> net.core.netdev_max_backlog = 250000 >> net.ipv4.tcp_max_syn_backlog = 30000 >> net.ipv4.tcp_max_tw_buckets = 2000000 >> net.ipv4.tcp_tw_reuse = 1 >> net.ipv4.tcp_tw_recycle = 1 >> net.ipv4.tcp_fin_timeout = 10 >> >> # Disable TCP slow start on idle connections >> net.ipv4.tcp_slow_start_after_idle = 0 >> >> # If your servers talk UDP, also up these limits >> net.ipv4.udp_rmem_min = 8192 >> net.ipv4.udp_wmem_min = 8192 >> >> # Disable source routing and redirects >> net.ipv4.conf.all.send_redirects = 0 >> net.ipv4.conf.all.accept_redirects = 0 >> net.ipv4.conf.all.accept_source_route = 0 >> >> # Recommended when jumbo frames are enabled >> net.ipv4.tcp_mtu_probing = 1 >> >> I have 40 Gbps links on my osd nodes, and 10 Gbps links on everything >> else. >> >> Let me know if that helps. >> > > Hi Jake, > > Could you talk a little bit about what scenarios you've seen tuning this > help? I noticed improvement in RGW performance in some cases with similar > TCP tunings, but it would be good to understand what other folks are seeing > and in what situations. > > >> Jake >> >> >> _______________________________________________ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> _______________________________________________ > ceph-users mailing list > > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > Hey Mark, I'm only using RBD. My clients are all VMware, so I have a few iSCSI proxy VMs (using rbd enabled tgt). My workload is typically light random read/write, except for the periodic eager zeroing of multi terabyte volumes. Since there is no VAAI in tgt, this turns into heavy sequential writing. I found the network tuning above helped to "open up" the connection from a single iSCSI proxy VM to the cluster. Note that my osd nodes have both a public network interface as well as a dedicated private network interface, which are both 40G. I believe the network tuning also has another effect of improving the performance of the cluster network (where the replication data is sent across), because initially I had only applied the kernel tuning to the osd nodes and saw a performance improvement before I implemented it on the iSCSI proxy VMs. I should mention that I did all of my testing back in firefly (about 1 year ago) and I haven't tried to remove these parameters from my cluster to see if there is a performance degradation now that I'm running Hammer. I guess there is a similar dataflow with RGW and using RBD with an iSCSI proxy server. Both have few RADOS clients, which funnel the requests of many HTTP/iSCSI clients. Jake
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com