On 10/9/2013 5:51 AM, Muhammad Yousuf Khan wrote: > [cut]........... > > > What workload do you have that requires 400 MB/s of parallel stream TCP >> throughput at the server? NFS, FTP, iSCSI? If this is a business >> requirement and you actually need this much bandwidth to/from one >> server, you will achieve far better results putting a 10GbE card in the >> server and a 10GbE uplink module in your switch. Yes, this costs more >> money, but the benefit is that all client hosts get full GbE bandwidth >> to/from the server, all the time, in both directions. You'll never >> achieve that with the Linux bonding driver. > > I appreciate your detailed email. it clears lots of confusion going inside > my mind. > the reason of increasing bandwidth is testing clustering /VM hosting on NFS > and VM backups. My company is about to host their product inside our > foreigner office-premises and i will be maintaining those servers > remotely,. therefore i need to consider high availability of our service > and that's why trying to test different technologies that can full-fill our > requirement. > > Specificity testing Ceph clustering for hosting purposes. and for backing > up my VMs as you know VMs are huge and moving them around on 1GB x-over > point to point link takes time. so i thought i could increase some of the > bandwidth and can use link aggregation to avoid single point of failure. > > i agree with you on buying 10GB LANs but unfortunately as i am testing > this stuff very far far away from US these cards are not easily available > in my country, thus unnecessarily expensive.
Are dual and quad port Intel NICs available in your country? > if you still have any advice for such scenario as mine i will be more glad > to have it. Before a person makes a first attempt at using the Linux bonding driver, s/he typically thinks that it will magically turn 2/4 links of Ethernet into one link that is 2/4x as fast. This is simply not the case, and is physically impossible. The 802.3xx specifications do not enable nor allow this. And TCP is not designed for this. All of the bonding modes are designed first for fault tolerance, and 2nd for increasing aggregate throughput, but here only from one host with bonded interfaces to many hosts with single interfaces. There is only one Linux bonding driver mode that can reliably yield greater than 1 link of send/receive throughput between two hosts, and that is balance-rr. To get it working without a lot of headaches requires a specific switch topology. And its throughput will not scale with the number of links. The reason for this is that you're breaking a single TCP session stream into 2 or 4 streams of Ethernet frames each carrying part of a single TCP stream. This can break many of the TCP stack optimizations such as window scaling, etc. You may also get out of order packets, depending on the NICs used, and how much buffering they do before generating an interrupt. Reordering of packets at the receiver decreases throughput. Thus each link will have less throughput than when running in standalone mode. Most of the above information is covered in the kernel document. The primary driving force you mentioned behind needing more bandwidth is backing up VM images. If that is the case, increase the bandwidth only where it is needed. Put a 4 port Intel NIC in the NFS server and a 4 port Intel NIC in the backup server. Use 4 crossover cables. Configure balance-rr and tweak bonding and TCP stack settings as necessary. Use a different IP subnet for this bonded link and modify the routing table as required. If you use the same subnet as regular traffic you must configure source based routing on these two hosts and this is a big PITA. Once you get this all setup correctly, this should yield somewhere between 1-3.5 Gb/s of throughput for a single TCP stream and/or multiple TCP streams between the NFS and backup servers. No virtual machine hosts should require more than 1 Gb/s throughput to the NFS server, so this is the most cost effective way to increase backup throughput and decrease backup time. WRT Ceph, AIUI, this object based storage engine does provide a POSIX filesystem interface. How complete the POSIX implementation is I do not know. I get the impression it's not entirely complete. That said, Ceph is supposed to "dynamically distribute data" across the storage nodes. This is extremely vague. If it actually spreads the blocks of a file across many nodes, or stores a complete copy of each file on every node, then theoretically it should provide more than 1 link of throughput to a client possessing properly bonded interfaces, as the file read is sent over many distinct TCP streams from multiple host interfaces. So if you store your VM images on a Ceph filesystem you will need a bonded interface on the backup server using mode balance-alb. With balance-alb properly configured and working on the backup server, you will need at minimum 4 Ceph storage nodes in order to approach 400 MB/s file throughput to the backup application. Personally I do not like non-deterministic throughput in a storage application, and all distributed filesystems exhibit non deterministic throughput. Especially so with balance-alb bonding on the backup server. Thus, you may want to consider another approach: build an NFS active/stand-by heartbeat cluster using two identical server boxes and disk, active/active DRBD mirroring, and GFS2 as the cluster filesystem atop the DRBD device. In this architecture you would install a quad port Intel NIC in each server and one in the backup server, connect all 12 ports to a dedicated switch. You configure balance-rr bonding on each of the 3 machines, again using a separate IP network from the "user" network, again configuring the routing table accordingly. In this scenario, assuming you do not intend to use NFS v4 clustering, you'd use one server to export NFS shares to the VM cluster nodes. This is your 'active' NFS server. The stand-by NFS server would, during normal operation, export the shares only to the backup server. Since both NFS servers have identical disk data, thanks to DRBD and GFS2, the backup server can suck the files from the standy-by NFS server at close to 400 MB/s, without impacting production NFS traffic to the VM hosts. If the active server goes down the stand-by server will execute scripts to take over the role of the active/primary server. So you have full redundancy. These scripts exist and are not something you must create from scratch. This clustered NFS configuration w/DRBD and GFS2 is a standard RHEL configuration. With Ceph, or Gluster, or any distributed storage, backup will always impact production throughput. Not from a network standpoint as you could add a dedicated network segment to the Ceph storage nodes to mitigate that. The problem is disk IOPS. With Ceph your production VMs will be hitting the same disks the backup server is hitting. So after all of that, the takeaway here is that bonding is not a general purpose solution, but very application specific. It has a very limited, narrow, use case. You must precisely match the number of ports and bonding mode to the target application/architecture. Linux bonding will NOT allow one to arbitrarily increase application bandwidth on all hosts in a subnet simply by slapping in extra ports and turning on a bonding mode. This should be clear to anyone who opens the kernel bonding driver how-to document I linked. It's 42 pages long. If bonding were general purpose, easy to configure, and provided anywhere close to the linear speedup lay people assume, then this doc would be 2-3 pages, not 42. -- Stan -- To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/5256453a.8060...@hardwarefreak.com