Re: [OMPI users] Crashes over TCP/ethernet but not on shared memory

2008-10-27 Thread Jeff Squyres
On Oct 24, 2008, at 12:10 PM, V. Ram wrote: Resuscitating this thread... Well, we spent some time testing the various options, and Leonardo's suggestion seems to work! We disabled TCP Segment Offloading on the e1000 NICs using "ethtool -K eth tso off" and this type of crash no longer happens.

Re: [OMPI users] Crashes over TCP/ethernet but not on shared memory

2008-10-24 Thread V. Ram
Resuscitating this thread... Well, we spent some time testing the various options, and Leonardo's suggestion seems to work! We disabled TCP Segment Offloading on the e1000 NICs using "ethtool -K eth tso off" and this type of crash no longer happens. I hope this message can help anyone else exper

Re: [OMPI users] Crashes over TCP/ethernet but not on shared memory

2008-10-10 Thread George Bosilca
On Oct 10, 2008, at 12:42 PM, V. Ram wrote: Can anyone else suggest why the code might be crashing when running over ethernet and not over shared memory? Any suggestions on how to debug this or interpret the error message issued from btl_tcp_frag.c ? Unfortunately this is a standard error

Re: [OMPI users] Crashes over TCP/ethernet but not on shared memory

2008-10-10 Thread V. Ram
Leonardo, These nodes are all using intel e1000 chips. As the nodes are AMD K7-based, these are the older chips, not the new ones with all the eeprom issues with the newer kernel. The kernel in use is from the 2.6.22 family, and the e1000 driver is the one shipped with the kernel. I am running

Re: [OMPI users] Crashes over TCP/ethernet but not on shared memory

2008-10-10 Thread V. Ram
Sorry for replying to this so late, but I have been away. Reply below... On Wed, 1 Oct 2008 11:58:30 -0400, "Aurélien Bouteiller" said: > If you have several network cards in your system, it can sometime get > the endpoints confused. Especially if you don't have the same number > of cards or

Re: [OMPI users] Crashes over TCP/ethernet but not on shared memory

2008-10-01 Thread Leonardo Fialho
Ram, What is the name and version of the kernel module for your NIC? I have experimented some similar with my tg3 module. The error which appeared for my was different: [btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: No route to host (113) I solved it changi

Re: [OMPI users] Crashes over TCP/ethernet but not on shared memory

2008-10-01 Thread Aurélien Bouteiller
If you have several network cards in your system, it can sometime get the endpoints confused. Especially if you don't have the same number of cards or don't use the same subnet for all "eth0, eth1". You should try to restrict Open MPI to use only one of the available networks by using the -