Re: [OMPI users] [btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)

Terry Frankcombe Tue, 23 Feb 2010 22:01:31 -0500

Vasp can be temperamental.  For example, I have a largish system (384
atoms) for which Vasp hangs if I request more than 120 MD steps at a
time.  I am not convinced that this is OPMI's problem.  However, your
case looks much more diagnosable than my silent spinning hang.


On Tue, 2010-02-23 at 16:00 -0500, Thomas Sadowski wrote:
> Hello all,
> 
> 
> I am currently attempting to use OpenMPI as my MPI for my VASP
> calculations. VASP is an ab initio DFT code. Anyhow, I was able to
> compile and build OpenMPI v. 1.4.1 (i thought) correctly using the
> following command:
> 
> ./configure --prefix=/home/tes98002 F77=ifort FC=ifort
> --with-tm=/usr/local
> 
> 
> Note that I am compiling OpenMPI for use with Torque/PBS which was
> compiled using Intel v 10 Fortran compilers and gcc for C\C++. After
> building OpenMPI, I successfully used it to compile VASP using Intel
> MKL v. 10.2. I am running OpenMPI in heterogeneous cluster
> configuration, and I used an NFS mount so that all the compute nodes
> could access the executable. Our hardware configuration is as follows:
> 
> 7 nodes: 2 single-core AMD Opteron processors, 2GB of RAM (henceforth
> called old nodes)
> 4 nodes: 2 duo-core AMD Opteron processors, 2GB of RAM (henceforth
> called new nodes)
> 
> We are currently running SUSE v. 8.x. No we have problems when we
> attempt to run VASP on multiple nodes. A small system (~10 atoms) runs
> perfectly well with Torque and OpenMPI in all instances: running using
> single old node, a single new node, or across multiple old and new
> nodes. Larger systems (>24 atoms) are able to run to completion if
> they are kept within a single old or new node. However, if I try to
> run a job on multiple old or new nodes I receive a segfault. In
> particular the error is as follows:
> 
> 
> [node12][[7759,1],2][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer
> (104)[node12][[7759,1],1][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
> [node12][[7759,1],3][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
> [node12][[7759,1],0][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
> [node12][[7759,1],1][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
> [node12][[7759,1],3][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
> [node12][[7759,1],0][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
> [node12][[7759,1],2][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
> --------------------------------------------------------------------------
> mpirun noticed that process rank 6 with PID 11985 on node node11
> exited on signal 11 (Segmentation fault).
> --------------------------------------------------------------------------
> forrtl: error (78): process killed (SIGTERM)
> forrtl: error (78): process killed (SIGTERM)
> forrtl: error (78): process killed (SIGTERM)
> forrtl: error (78): process killed (SIGTERM)
> 
> 
> 
> It seems to me that this is a memory issue, however I may be mistaken.
> I have searched the archive and have as yet seen an adequate treatment
> of the problem. I have also tried other versions of OpenMPI. Does
> anyone have any insight into our issues
> 
> 
> -Tom
>  
> 
> 
> 
> 
> 
> 
> ______________________________________________________________________
> Hotmail: Trusted email with powerful SPAM protection. Sign up now.
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] [btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)

Reply via email to