Hello all,
I am currently attempting to use OpenMPI as my MPI for my VASP calculations. VASP is an ab initio DFT code. Anyhow, I was able to compile and build OpenMPI v. 1.4.1 (i thought) correctly using the following command: ./configure --prefix=/home/tes98002 F77=ifort FC=ifort --with-tm=/usr/local Note that I am compiling OpenMPI for use with Torque/PBS which was compiled using Intel v 10 Fortran compilers and gcc for C\C++. After building OpenMPI, I successfully used it to compile VASP using Intel MKL v. 10.2. I am running OpenMPI in heterogeneous cluster configuration, and I used an NFS mount so that all the compute nodes could access the executable. Our hardware configuration is as follows: 7 nodes: 2 single-core AMD Opteron processors, 2GB of RAM (henceforth called old nodes) 4 nodes: 2 duo-core AMD Opteron processors, 2GB of RAM (henceforth called new nodes) We are currently running SUSE v. 8.x. No we have problems when we attempt to run VASP on multiple nodes. A small system (~10 atoms) runs perfectly well with Torque and OpenMPI in all instances: running using single old node, a single new node, or across multiple old and new nodes. Larger systems (>24 atoms) are able to run to completion if they are kept within a single old or new node. However, if I try to run a job on multiple old or new nodes I receive a segfault. In particular the error is as follows: [node12][[7759,1],2][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)[node12][[7759,1],1][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] [node12][[7759,1],3][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] [node12][[7759,1],0][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104) mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104) [node12][[7759,1],1][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104) mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104) [node12][[7759,1],3][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104) [node12][[7759,1],0][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104) [node12][[7759,1],2][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104) -------------------------------------------------------------------------- mpirun noticed that process rank 6 with PID 11985 on node node11 exited on signal 11 (Segmentation fault). -------------------------------------------------------------------------- forrtl: error (78): process killed (SIGTERM) forrtl: error (78): process killed (SIGTERM) forrtl: error (78): process killed (SIGTERM) forrtl: error (78): process killed (SIGTERM) It seems to me that this is a memory issue, however I may be mistaken. I have searched the archive and have as yet seen an adequate treatment of the problem. I have also tried other versions of OpenMPI. Does anyone have any insight into our issues -Tom _________________________________________________________________ Hotmail: Trusted email with powerful SPAM protection. http://clk.atdmt.com/GBL/go/201469227/direct/01/