i apologise for not being very clear. (its been a long day.) we have an 8-node cluster. each node is a modest dell 910 or somesuch; 128GB mem and 32 cores. each node also has 8 1Gbps NICs; most are rarely used but 2 are used a lot. technically, these occupy 4 interfaces as they are two bonded pairs (active-passive). each bonded pair is on one of two VPNs. one VPN goes out to the general intranet; the other is a VPN local to the cluster.
the local VPN is pounded on hard. i estimate 800 Mbps during peak hours. i have noticed no performance issues with the traffic on this VPN; its all zeromq message traffic but i carefully monitor it for latency. the messages i send are typically 100+ bytes, and zeromq normally bundles several together for transmission. on the external VPN, we have 80 inbound feeds (10 per nodes) typically around 23 Mbps each. what we notice is that these socket connections occasionally go dry, that is, data stops coming. using tcpdump and sniffers, we determine that was because the server starts sending window size 0 messages back to the source systems. in fact, a sniffer revealed the window size starting around 26K and then quite quickly dropping all the way down to 10, 8, 1 and then zero. at this point, the processes receiving data over the socket exits, and gets restarted a couple of minutes later. by then, the condition clears, the window size goes back up to 26K and all is well for 6-10 minutes and then some other group of sockets fails. strangely, not every socket on a node fails; sometimes most, sometimes just a few, rarely all. i take a window size of zero as definitive of tcp/ip stack congestion. but i freely admit to not knowing (nor wanting to know) about networking. which is why i ask for advice on this group. thanks andrew On Oct 17, 2012, at 8:16 PM, Nathan Hruby wrote: > On Wed, Oct 17, 2012 at 5:22 PM, Andrew Hume <and...@research.att.com> wrote: >> screwed by linux again. sigh. >> >> so apparently i am overloading my pathetic linux system with too much tcp/ip >> traffic. >> is there any way to detect this while (or before or after) it is happening? >> of course no error messages are emitted. >> but might there be some other thing buried away somewhere, like /proc? > > If there are no messages emitted, how do you know it's overloaded with > [network] traffic? > > -n > -- > ------------------------------------------- > nathan hruby <nhr...@gmail.com> > metaphysically wrinkle-free > ------------------------------------------- ----------------------- Andrew Hume 623-551-2845 (VO and best) 973-236-2014 (NJ) and...@research.att.com
_______________________________________________ Tech mailing list Tech@lists.lopsa.org https://lists.lopsa.org/cgi-bin/mailman/listinfo/tech This list provided by the League of Professional System Administrators http://lopsa.org/