Vasp can be temperamental. For example, I have a largish system (384 atoms) for which Vasp hangs if I request more than 120 MD steps at a time. I am not convinced that this is OPMI's problem. However, your case looks much more diagnosable than my silent spinning hang.
On Tue, 2010-02-23 at 16:00 -0500, Thomas Sadowski wrote: > Hello all, > > > I am currently attempting to use OpenMPI as my MPI for my VASP > calculations. VASP is an ab initio DFT code. Anyhow, I was able to > compile and build OpenMPI v. 1.4.1 (i thought) correctly using the > following command: > > ./configure --prefix=/home/tes98002 F77=ifort FC=ifort > --with-tm=/usr/local > > > Note that I am compiling OpenMPI for use with Torque/PBS which was > compiled using Intel v 10 Fortran compilers and gcc for C\C++. After > building OpenMPI, I successfully used it to compile VASP using Intel > MKL v. 10.2. I am running OpenMPI in heterogeneous cluster > configuration, and I used an NFS mount so that all the compute nodes > could access the executable. Our hardware configuration is as follows: > > 7 nodes: 2 single-core AMD Opteron processors, 2GB of RAM (henceforth > called old nodes) > 4 nodes: 2 duo-core AMD Opteron processors, 2GB of RAM (henceforth > called new nodes) > > We are currently running SUSE v. 8.x. No we have problems when we > attempt to run VASP on multiple nodes. A small system (~10 atoms) runs > perfectly well with Torque and OpenMPI in all instances: running using > single old node, a single new node, or across multiple old and new > nodes. Larger systems (>24 atoms) are able to run to completion if > they are kept within a single old or new node. However, if I try to > run a job on multiple old or new nodes I receive a segfault. In > particular the error is as follows: > > > [node12][[7759,1],2][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] > mca_btl_tcp_frag_recv: readv failed: Connection reset by peer > (104)[node12][[7759,1],1][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] > [node12][[7759,1],3][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] > [node12][[7759,1],0][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] > mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104) > mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104) > [node12][[7759,1],1][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] > mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104) > mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104) > [node12][[7759,1],3][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] > mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104) > [node12][[7759,1],0][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] > mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104) > [node12][[7759,1],2][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] > mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104) > -------------------------------------------------------------------------- > mpirun noticed that process rank 6 with PID 11985 on node node11 > exited on signal 11 (Segmentation fault). > -------------------------------------------------------------------------- > forrtl: error (78): process killed (SIGTERM) > forrtl: error (78): process killed (SIGTERM) > forrtl: error (78): process killed (SIGTERM) > forrtl: error (78): process killed (SIGTERM) > > > > It seems to me that this is a memory issue, however I may be mistaken. > I have searched the archive and have as yet seen an adequate treatment > of the problem. I have also tried other versions of OpenMPI. Does > anyone have any insight into our issues > > > -Tom > > > > > > > > ______________________________________________________________________ > Hotmail: Trusted email with powerful SPAM protection. Sign up now. > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users