Hi I'm setting up a Debian GNU/Linux based cluster, currently with 4 nodes, each a PPro 200 :( but there may be more/other stuff coming :). Considering the costs, we settled for Netgear 311 ethernet cards, for which there is support in 2.4.x kernels. Patches for 2.2.x, but since 2.4 is here... By the way I'm running unstable on these.
Initially we have put 2 ethernet cards in each node, and today was spent getting bonding to work. This is supported in late 2.2.x kernels and 2.4.x of course. But it was a bit tricky to find the correct ifenslave.c to compile and use. Once that was done (http://pdsf.nersc.gov/linux/), everything seemed to work as planned after doing ifconfig bond0 192.168.1.x netmask 255.255.255.0 up ./ifenslave bond0 eth0 (bond0 gets the MAC adress from eth0) ./ifenslave bond0 eth1 But when testing the setup by ftping a large file between two nodes, each configured as above (x=101,103 respectively), messages of the following type was output repeatedly on the console: ethX ... Something wicked happened! 0YYY X was 0 or 1 YYY was one of 500, 700, 740, 749, 749 as far as I can tell Same thing happened when running NPtcp as package size came above a few kbytes, speeds approx 50MBits per second. I also tested the network cards eth0 to eth0 and eth1 to eth1 in normal mode (no bonding) with NPtcp and both lines asymptotically went up to some 89.7Mbits per second. By the way where are the last 10? Anyone got ideas as to the nature/solution of this problem? I did locate the error string in drivers/net/natsemi.c in the function netdev_error but I don't know what to make of it. Does anyone have experience of this with for instance 3c905 which I in my opinion is very stable etc? It is also about three times more expensive which isn't that much for one or two, although I could imagine substantial savings for a large cluster. But if my hours are included ... Regards, Anders PS Some detailed info: >From syslog, identifying network cards: (eth2 is for accessing from outside the dedicated networks) Mar 1 21:30:53 beo101 kernel: http://www.scyld.com/network/natsemi.html Mar 1 21:30:53 beo101 kernel: (unofficial 2.4.x kernel port, version 1.0.3, January 21, 2001 Jeff Garzik, Tjeerd Mulder) Mar 1 21:30:53 beo101 kernel: eth0: NatSemi DP83815 at 0xc4800000, 00:02:e3:03:da:87, IRQ 12. Mar 1 21:30:53 beo101 kernel: eth0: Transceiver status 0x7869 advertising 05e1. Mar 1 21:30:53 beo101 kernel: eth1: NatSemi DP83815 at 0xc4802000, 00:02:e3:03:de:43, IRQ 10. Mar 1 21:30:53 beo101 kernel: eth1: Transceiver status 0x7869 advertising 05e1. Mar 1 21:30:53 beo101 kernel: eth2: NatSemi DP83815 at 0xc4804000, 00:02:e3:03:dc:2c, IRQ 11. Mar 1 21:30:53 beo101 kernel: eth2: Transceiver status 0x7869 advertising 05e1. some lines of the wicked message: (above those are the two lines where eth0 and eth1 are reported when ifenslave is run) Mar 1 21:30:56 beo101 /usr/sbin/cron[189]: (CRON) STARTUP (fork ok) Mar 1 21:35:26 beo101 kernel: eth0: Setting full-duplex based on negotiated link capability. Mar 1 21:35:32 beo101 ntpd[182]: time reset -0.474569 s Mar 1 21:35:32 beo101 ntpd[182]: kernel pll status change 41 Mar 1 21:35:32 beo101 ntpd[182]: synchronisation lost Mar 1 21:35:37 beo101 kernel: eth1: Setting full-duplex based on negotiated link capability. Mar 1 21:38:01 beo101 /USR/SBIN/CRON[211]: (mail) CMD ( if [ -x /usr/sbin/exim -a -f /etc/exim.conf ]; then /usr/sbin/exim -q >/dev/null 2>&1; fi) Mar 1 21:39:49 beo101 kernel: eth1: Something Wicked happened! 0700. Mar 1 21:40:04 beo101 kernel: eth0: Something Wicked happened! 0700. Mar 1 21:40:08 beo101 kernel: eth1: Something Wicked happened! 0700. Mar 1 21:40:08 beo101 kernel: eth0: Something Wicked happened! 0700. Mar 1 21:40:12 beo101 last message repeated 2 times Mar 1 21:40:12 beo101 kernel: eth1: Something Wicked happened! 0700. Mar 1 21:40:13 beo101 last message repeated 2 times Mar 1 21:40:15 beo101 kernel: eth0: Something Wicked happened! 0700. Mar 1 21:40:16 beo101 kernel: eth0: Something Wicked happened! 0700. Mar 1 21:40:18 beo101 kernel: eth1: Something Wicked happened! 0700. Mar 1 21:40:19 beo101 kernel: eth1: Something Wicked happened! 0700. Mar 1 21:40:19 beo101 kernel: eth0: Something Wicked happened! 0700. Mar 1 21:40:20 beo101 kernel: eth1: Something Wicked happened! 0700. Mar 1 21:40:20 beo101 kernel: eth1: Something Wicked happened! 0700. Mar 1 21:40:21 beo101 kernel: eth0: Something Wicked happened! 0700. Mar 1 21:40:22 beo101 last message repeated 3 times Mar 1 21:40:22 beo101 kernel: eth1: Something Wicked happened! 0700. Mar 1 21:40:22 beo101 kernel: eth1: Something Wicked happened! 0700. Mar 1 21:40:22 beo101 kernel: eth0: Something Wicked happened! 0700. Mar 1 21:40:22 beo101 kernel: eth0: Something Wicked happened! 0700. Mar 1 21:40:22 beo101 kernel: eth1: Something Wicked happened! 0700. Mar 1 21:40:22 beo101 kernel: eth0: Something Wicked happened! 0500. Mar 1 21:40:22 beo101 kernel: eth0: Something Wicked happened! 0740. Mar 1 21:40:22 beo101 kernel: eth0: Something Wicked happened! 0740. Mar 1 21:40:23 beo101 kernel: eth0: Something Wicked happened! 0700. Mar 1 21:40:23 beo101 kernel: eth1: Something Wicked happened! 0700. Mar 1 21:40:23 beo101 kernel: eth1: Something Wicked happened! 0700. Mar 1 21:40:23 beo101 kernel: eth0: Something Wicked happened! 0740. Mar 1 21:40:23 beo101 kernel: eth1: Something Wicked happened! 0740. Mar 1 21:40:23 beo101 kernel: eth1: Something Wicked happened! 0700. Mar 1 21:40:23 beo101 kernel: eth1: Something Wicked happened! 0700. Mar 1 21:40:23 beo101 kernel: eth1: Something Wicked happened! 0500. Mar 1 21:40:23 beo101 kernel: eth0: Something Wicked happened! 0500. Mar 1 21:40:23 beo101 kernel: eth0: Something Wicked happened! 0700. Mar 1 21:40:23 beo101 kernel: eth1: Something Wicked happened! 0700. Mar 1 21:40:23 beo101 kernel: eth1: Something Wicked happened! 0700. The result of ifconfig: bond0 Link encap:Ethernet HWaddr 00:02:E3:03:DA:87 inet addr:192.168.1.101 Bcast:192.168.1.255 Mask:255.255.255.0 UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:1834429 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:0 (0.0 b) TX bytes:986886789 (941.1 Mb) eth0 Link encap:Ethernet HWaddr 00:02:E3:03:DA:87 inet addr:192.168.1.101 Bcast:192.168.1.255 Mask:255.255.255.0 UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1 RX packets:907798 errors:0 dropped:0 overruns:0 frame:0 TX packets:915439 errors:1776 dropped:0 overruns:1776 carrier:1776 collisions:0 txqueuelen:100 RX bytes:435552233 (415.3 Mb) TX bytes:491795214 (469.0 Mb) Interrupt:12 eth1 Link encap:Ethernet HWaddr 00:02:E3:03:DA:87 inet addr:192.168.1.101 Bcast:192.168.1.255 Mask:255.255.255.0 UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1 RX packets:907768 errors:0 dropped:0 overruns:0 frame:0 TX packets:915466 errors:1748 dropped:0 overruns:1748 carrier:1748 collisions:0 txqueuelen:100 RX bytes:434992308 (414.8 Mb) TX bytes:489766183 (467.0 Mb) Interrupt:10 Base address:0x2000 eth2 Link encap:Ethernet HWaddr 00:02:E3:03:DC:2C inet addr:150.227.64.210 Bcast:150.227.64.255 Mask:255.255.255.0 UP BROADCAST RUNNING MTU:1500 Metric:1 RX packets:13122 errors:0 dropped:0 overruns:0 frame:0 TX packets:1182 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:100 RX bytes:1032660 (1008.4 Kb) TX bytes:943713 (921.5 Kb) Interrupt:11 Base address:0x4000 lo Link encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 UP LOOPBACK RUNNING MTU:3904 Metric:1 RX packets:8 errors:0 dropped:0 overruns:0 frame:0 TX packets:8 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:552 (552.0 b) TX bytes:552 (552.0 b)