On Wed, Jul 29, 2015 at 11:23 AM, Mark Nelson <mnel...@redhat.com> wrote:

> On 07/29/2015 10:13 AM, Jake Young wrote:
>
>> On Tue, Jul 28, 2015 at 11:48 AM, SCHAER Frederic
>> <frederic.sch...@cea.fr <mailto:frederic.sch...@cea.fr>> wrote:
>>  >
>>  > Hi again,
>>  >
>>  > So I have tried
>>  > - changing the cpus frequency : either 1.6GHZ, or 2.4GHZ on all cores
>>  > - changing the memory configuration, from "advanced ecc mode" to
>> "performance mode", boosting the memory bandwidth from 35GB/s to 40GB/s
>>  > - plugged a second 10GB/s link and setup a ceph internal network
>>  > - tried various "tuned-adm profile" such as "throughput-performance"
>>  >
>>  > This changed about nothing.
>>  >
>>  > If
>>  > - the CPUs are not maxed out, and lowering the frequency doesn't
>> change a thing
>>  > - the network is not maxed out
>>  > - the memory doesn't seem to have an impact
>>  > - network interrupts are spread across all 8 cpu cores and receive
>> queues are OK
>>  > - disks are not used at their maximum potential (iostat shows my dd
>> commands produce much more tps than the 4MB ceph transfers...)
>>  >
>>  > Where can I possibly find a bottleneck ?????
>>  >
>>  > I'm /(almost) out of ideas/ ... :'(
>>  >
>>  > Regards
>>  >
>>  >
>> Frederic,
>>
>> I was trying to optimize my ceph cluster as well and I looked at all of
>> the same things you described, which didn't help my performance
>> noticeably.
>>
>> The following network kernel tuning settings did help me significantly.
>>
>> This is my /etc/sysctl.conf file on all of  my hosts: ceph mons, ceph
>> osds and any client that connects to my ceph cluster.
>>
>>          # Increase Linux autotuning TCP buffer limits
>>          # Set max to 16MB for 1GE and 32M (33554432) or 54M (56623104)
>> for 10GE
>>          # Don't set tcp_mem itself! Let the kernel scale it based on RAM.
>>          #net.core.rmem_max = 56623104
>>          #net.core.wmem_max = 56623104
>>          # Use 128M buffers
>>          net.core.rmem_max = 134217728
>>          net.core.wmem_max = 134217728
>>          net.core.rmem_default = 67108864
>>          net.core.wmem_default = 67108864
>>          net.core.optmem_max = 134217728
>>          net.ipv4.tcp_rmem = 4096 87380 67108864
>>          net.ipv4.tcp_wmem = 4096 65536 67108864
>>
>>          # Make room for more TIME_WAIT sockets due to more clients,
>>          # and allow them to be reused if we run out of sockets
>>          # Also increase the max packet backlog
>>          net.core.somaxconn = 1024
>>          # Increase the length of the processor input queue
>>          net.core.netdev_max_backlog = 250000
>>          net.ipv4.tcp_max_syn_backlog = 30000
>>          net.ipv4.tcp_max_tw_buckets = 2000000
>>          net.ipv4.tcp_tw_reuse = 1
>>          net.ipv4.tcp_tw_recycle = 1
>>          net.ipv4.tcp_fin_timeout = 10
>>
>>          # Disable TCP slow start on idle connections
>>          net.ipv4.tcp_slow_start_after_idle = 0
>>
>>          # If your servers talk UDP, also up these limits
>>          net.ipv4.udp_rmem_min = 8192
>>          net.ipv4.udp_wmem_min = 8192
>>
>>          # Disable source routing and redirects
>>          net.ipv4.conf.all.send_redirects = 0
>>          net.ipv4.conf.all.accept_redirects = 0
>>          net.ipv4.conf.all.accept_source_route = 0
>>
>>          # Recommended when jumbo frames are enabled
>>          net.ipv4.tcp_mtu_probing = 1
>>
>> I have 40 Gbps links on my osd nodes, and 10 Gbps links on everything
>> else.
>>
>> Let me know if that helps.
>>
>
> Hi Jake,
>
> Could you talk a little bit about what scenarios you've seen tuning this
> help?  I noticed improvement in RGW performance in some cases with similar
> TCP tunings, but it would be good to understand what other folks are seeing
> and in what situations.
>
>
>> Jake
>>
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>  _______________________________________________
> ceph-users mailing list
>
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

Hey Mark,

I'm only using RBD.  My clients are all VMware, so I have a few iSCSI proxy
VMs (using rbd enabled tgt).  My workload is typically light random
read/write, except for the periodic eager zeroing of multi terabyte
volumes.  Since there is no VAAI in tgt, this turns into heavy sequential
writing.

I found the network tuning above helped to "open up" the connection from a
single iSCSI proxy VM to the cluster.

Note that my osd nodes have both a public network interface as well as a
dedicated private network interface, which are both 40G.  I believe the
network tuning also has another effect of improving the performance of the
cluster network (where the replication data is sent across), because
initially I had only applied the kernel tuning to the osd nodes and saw a
performance improvement before I implemented it on the iSCSI proxy VMs.

I should mention that I did all of my testing back in firefly (about 1 year
ago) and I haven't tried to remove these parameters from my cluster to see
if there is a performance degradation now that I'm running Hammer.

I guess there is a similar dataflow with RGW and using RBD with an iSCSI
proxy server.  Both have few RADOS clients, which funnel the requests of
many HTTP/iSCSI clients.

Jake
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to