Sorry for replying to this so late, but I have been away.  Reply
below...

On Wed, 1 Oct 2008 11:58:30 -0400, "Aurélien Bouteiller"
<boute...@eecs.utk.edu> said:
> If you have several network cards in your system, it can sometime get  
> the endpoints confused. Especially if you don't have the same number  
> of cards or don't use the same subnet for all "eth0, eth1". You should  
> try to restrict Open MPI to use only one of the available networks by  
> using the --mca btl_tcp_if_include ethx parameter to mpirun, where x  
> is the network interface that is always connected to the same logical  
> and physical network on your machine.

I was pretty sure this wasn't the problem since basically all the nodes
only have one interface configured, but I had the user try the --mca
btl_tcp_if_include parameter.  The same result / crash occurred.

> 
> Aurelien
> 
> Le 1 oct. 08 à 11:47, V. Ram a écrit :
> 
> > I wrote earlier about one of my users running a third-party Fortran  
> > code
> > on 32-bit x86 machines, using OMPI 1.2.7, that is having some odd  
> > crash
> > behavior.
> >
> > Our cluster's nodes all have 2 single-core processors.  If this code  
> > is
> > run on 2 processors on 1 node, it runs seemingly fine.  However, if  
> > the
> > job runs on 1 processor on each of 2 nodes (e.g., mpirun --bynode),  
> > then
> > it crashes and gives messages like:
> >
> > [node4][0,1,4][btl_tcp_frag.c:202:mca_btl_tcp_frag_recv]
> > [node3][0,1,3][btl_tcp_frag.c:202:mca_btl_tcp_frag_recv]
> > mca_btl_tcp_frag_recv: readv failed with errno=110
> > mca_btl_tcp_frag_recv: readv failed with errno=104
> >
> > Essentially, if any network communication is involved, the job crashes
> > in this form.
> >
> > I do have another user that runs his own MPI code on 10+ of these
> > processors for days at a time without issue, so I don't think it's
> > hardware.
> >
> > The original code also runs fine across many networked nodes if the
> > architecture is x86-64 (also running OMPI 1.2.7).
> >
> > We have also tried different Fortran compilers (both PathScale and
> > gfortran) and keep getting these crashes.
> >
> > Are there any suggestions on how to figure out if it's a problem with
> > the code or the OMPI installation/software on the system? We have  
> > tried
> > "--debug-daemons" with no new/interesting information being revealed.
> > Is there a way to trap segfault messages or more detailed MPI
> > transaction information or anything else that could help diagnose  
> > this?
> >
> > Thanks.
> > -- 
> >  V. Ram
> >  v_r_...@fastmail.fm
-- 
  V. Ram
  v_r_...@fastmail.fm

-- 
http://www.fastmail.fm - A no graphics, no pop-ups email service


Reply via email to