Sorry for replying to this so late, but I have been away. Reply below... On Wed, 1 Oct 2008 11:58:30 -0400, "Aurélien Bouteiller" <boute...@eecs.utk.edu> said: > If you have several network cards in your system, it can sometime get > the endpoints confused. Especially if you don't have the same number > of cards or don't use the same subnet for all "eth0, eth1". You should > try to restrict Open MPI to use only one of the available networks by > using the --mca btl_tcp_if_include ethx parameter to mpirun, where x > is the network interface that is always connected to the same logical > and physical network on your machine.
I was pretty sure this wasn't the problem since basically all the nodes only have one interface configured, but I had the user try the --mca btl_tcp_if_include parameter. The same result / crash occurred. > > Aurelien > > Le 1 oct. 08 à 11:47, V. Ram a écrit : > > > I wrote earlier about one of my users running a third-party Fortran > > code > > on 32-bit x86 machines, using OMPI 1.2.7, that is having some odd > > crash > > behavior. > > > > Our cluster's nodes all have 2 single-core processors. If this code > > is > > run on 2 processors on 1 node, it runs seemingly fine. However, if > > the > > job runs on 1 processor on each of 2 nodes (e.g., mpirun --bynode), > > then > > it crashes and gives messages like: > > > > [node4][0,1,4][btl_tcp_frag.c:202:mca_btl_tcp_frag_recv] > > [node3][0,1,3][btl_tcp_frag.c:202:mca_btl_tcp_frag_recv] > > mca_btl_tcp_frag_recv: readv failed with errno=110 > > mca_btl_tcp_frag_recv: readv failed with errno=104 > > > > Essentially, if any network communication is involved, the job crashes > > in this form. > > > > I do have another user that runs his own MPI code on 10+ of these > > processors for days at a time without issue, so I don't think it's > > hardware. > > > > The original code also runs fine across many networked nodes if the > > architecture is x86-64 (also running OMPI 1.2.7). > > > > We have also tried different Fortran compilers (both PathScale and > > gfortran) and keep getting these crashes. > > > > Are there any suggestions on how to figure out if it's a problem with > > the code or the OMPI installation/software on the system? We have > > tried > > "--debug-daemons" with no new/interesting information being revealed. > > Is there a way to trap segfault messages or more detailed MPI > > transaction information or anything else that could help diagnose > > this? > > > > Thanks. > > -- > > V. Ram > > v_r_...@fastmail.fm -- V. Ram v_r_...@fastmail.fm -- http://www.fastmail.fm - A no graphics, no pop-ups email service