Здравствуйте! On Fri, Oct 30, 2015 at 09:30:40PM +0000, moloney wrote:
> Hi, > I recently got my first Ceph cluster up and running and have been doing some > stress tests. I quickly found that during sequential write benchmarks the > throughput would often drop to zero. Initially I saw this inside QEMU virtual > machines, but I can also reproduce the issue with "rados bench" within 5-10 > minutes of sustained writes. If left alone the writes will eventually start > going again, but it takes quite a while (at least a couple minutes). If I > stop and restart the benchmark the write throughput will immediately be where > it is supposed to be. > I have convinced myself it is not a network hardware issue. I can load up > the network with a bunch of parallel iperf benchmarks and it keeps chugging > along happily. When the issue occurs with Ceph I don't see any indications of > network issues (e.g. dropped packets). Adding additional network load during > the rados bench (using iperf) doesn't seem to trigger the issue any faster or > more often. > I have also convinced myself it isn't an issue with a journal getting full or > an OSD being too busy. The amount of data being written before the problem > occurs is much larger than the total journal capacity. Watching the load on > the OSD servers with top/iostat I don't seen anything being overloaded, > rather I see the load everywhere drop to essentially zero when the writes > stall. Before the writes stall the load is well distributed with no visible > hot spots. The OSDs and hosts that report slow requests are random, so I > don't think it is a failing disk or server. I don't see anything interesting > going on in the logs so far (I am just about to do some tests with Ceph's > debug logging cranked up). > The cluster specs are: > OS: Ubuntu 14.04 with 3.16 kernel > Ceph: 9.1.0 > OSD Filesystem: XFS > Replication: 3X > Two racks with IPoIB network > 10Gbps Ethernet between racks > 8 OSD servers with: > * Dual Xeon E5-2630L (12 cores @ 2.4GHz) > * 128GB RAM > * 12 6TB Seagate drives (connected to LSI 2208 chip in JBOD mode) > * Two 400GB Intel P3600 NVMe drives (OS on RAID1 partition, 6 partitions > for OSD journals each) > * Mellanox ConnectX-3 NIC (for both Infiniband and 10Gbps Ethernet) > 3 Mons collocated on OSD servers > Any advice is greatly appreciated. I am planning to try this with Hammer too. I had the same trouble with Hammer, Ubuntu 14.04 and 3.19 kernel on Supermicro X9DRL-3F/iF with Intel 82599ES, bounded into one links to 2 different Cisco Nexus 5020. It was finally fixed with dropping down MTU from 1500+ to 1500. It was working with 9000 and folowing sysctls, but after several weeks trouble repeated and I had to drop mtu down again: net.ipv4.tcp_rmem= 1024000 8738000 1677721600 net.ipv4.tcp_wmem= 1024000 8738000 1677721600 net.ipv4.tcp_mem= 1024000 8738000 1677721600 net.core.netdev_max_backlog = 250000 net.ipv4.tcp_max_syn_backlog = 150000 net.ipv4.tcp_congestion_control=htcp net.ipv4.tcp_mtu_probing=1 net.ipv4.tcp_max_tw_buckets = 2000000 net.ipv4.tcp_fin_timeout = 10 net.ipv4.tcp_slow_start_after_idle = 0 net.ipv4.tcp_low_latency = 1 vm.swappiness = 1 net.ipv4.tcp_moderate_rcvbuf = 0 All > Thanks, > Brendan > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- WBR, Max A. Krasilnikov ColoCall Data Center _______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com