Also, I see mention in your FAQ about config.log. My openmpi does not appear to be generating it, at least not anywhere in the install tree. How can I enable the creation of the log file?
Thanks, Dennis -----Original Message----- From: Dennis McRitchie Sent: Friday, February 02, 2007 6:08 PM To: 'Open MPI Users' Subject: Can't run simple job with openmpi using the Intel compiler When I submit a simple job (described below) using PBS, I always get one of the following two errors: 1) [adroit-28:03945] [0,0,1]-[0,0,0] mca_oob_tcp_peer_recv_blocking: recv() failed with errno=104 2) [adroit-30:03770] [0,0,3]-[0,0,0] mca_oob_tcp_peer_complete_connect: connection failed (errno=111) - retrying (pid=3770) The program does a uname and prints out results to standard out. The only MPI calls it makes are MPI_Init, MPI_Comm_size, MPI_Comm_rank, and MPI_Finalize. I have tried it with both openmpi v 1.1.2 and 1.1.4, built with Intel C compiler 9.1.045, and get the same results. But if I build the same versions of openmpi using gcc, the test program always works fine. The app itself is built with mpicc. It runs successfully if run from the command line with "mpiexec -n X <test-program-name>", where X is 1 to 8, but if I wrap it in the following qsub command file: --------------------------------------------------- #PBS -l pmem=512mb,nodes=1:ppn=1,walltime=0:10:00 #PBS -m abe # #PBS -o /home0/dmcr/my_mpi/curt/uname_test.gcc.stdout # #PBS -e /home0/dmcr/my_mpi/curt/uname_test.gcc.stderr cd /home/dmcr/my_mpi/openmpi echo "About to call mpiexec" module list mpiexec -n 1 uname_test.intel echo "After call to mpiexec" ---------------------------------------------------- it fails on any number of processors from 1 to 8, and the application segfaults. The complete standard error of an 8-processsor job follows (note that mpiexec ran on adroit-31, but usually there is no info about adroit-31 in standard error): ------------------------- Currently Loaded Modulefiles: 1) intel/9.1/32/C/9.1.045 4) intel/9.1/32/default 2) intel/9.1/32/Fortran/9.1.040 5) openmpi/intel/1.1.2/32 3) intel/9.1/32/Iidb/9.1.045 Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR) Failing at addr:0x5 [0] func:/usr/local/openmpi/1.1.4/intel/i386/lib/libopal.so.0 [0xb72c5b] *** End of error message *** ^@[adroit-29:03934] [0,0,2]-[0,0,0] mca_oob_tcp_peer_recv_blocking: recv() failed with errno=104 [adroit-28:03945] [0,0,1]-[0,0,0] mca_oob_tcp_peer_recv_blocking: recv() failed with errno=104 [adroit-30:03770] [0,0,3]-[0,0,0] mca_oob_tcp_peer_complete_connect: connection failed (errno=111) - retrying (pid=3770) -------------------------- The complete standard error of an 1-processsor job follows: -------------------------- Currently Loaded Modulefiles: 1) intel/9.1/32/C/9.1.045 4) intel/9.1/32/default 2) intel/9.1/32/Fortran/9.1.040 5) openmpi/intel/1.1.2/32 3) intel/9.1/32/Iidb/9.1.045 Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR) Failing at addr:0x2 [0] func:/usr/local/openmpi/1.1.2/intel/i386/lib/libopal.so.0 [0x27d847] *** End of error message *** ^@[adroit-31:08840] [0,0,1]-[0,0,0] mca_oob_tcp_peer_complete_connect: connection failed (errno=111) - retrying (pid=8840) --------------------------- Any thoughts as to why this might be failing? Thanks, Dennis Dennis McRitchie Computational Science and Engineering Support (CSES) Academic Services Department Office of Information Technology Princeton University