Sorry for the problem - the issue is a bug in the handling of the pernode option in 1.4.2. This has been fixed and awaits release in 1.4.3.
On Jun 21, 2010, at 5:27 PM, Riccardo Murri wrote: > Hello, > > I'm using OpenMPI 1.4.2 on a Rocks 5.2 cluster. I compiled it on my > own to have a thread-enabled MPI (the OMPI coming with Rocks 5.2 > apparently only supports MPI_THREAD_SINGLE), and installed into ~/sw. > > To test the newly installed library I compiled a simple "hello world" > that comes with Rocks:: > > [murri@idgc3grid01 hello_mpi.d]$ cat hello_mpi.c > #include <stdio.h> > #include <sys/utsname.h> > > #include <mpi.h> > > int main(int argc, char **argv) { > int myrank; > struct utsname unam; > > MPI_Init(&argc, &argv); > > uname(&unam); > MPI_Comm_rank(MPI_COMM_WORLD, &myrank); > printf("Hello from rank %d on host %s\n", myrank, unam.nodename); > > MPI_Finalize(); > } > > The program runs fine as long as it only uses ranks on localhost:: > > [murri@idgc3grid01 hello_mpi.d]$ mpirun --host localhost -np 2 hello_mpi > Hello from rank 1 on host idgc3grid01.uzh.ch > Hello from rank 0 on host idgc3grid01.uzh.ch > > However, as soon as I try to run on more than one host, I get a > segfault:: > > [murri@idgc3grid01 hello_mpi.d]$ mpirun --host > idgc3grid01,compute-0-11 --pernode hello_mpi > [idgc3grid01:13006] *** Process received signal *** > [idgc3grid01:13006] Signal: Segmentation fault (11) > [idgc3grid01:13006] Signal code: Address not mapped (1) > [idgc3grid01:13006] Failing at address: 0x50 > [idgc3grid01:13006] [ 0] /lib64/libpthread.so.0 [0x359420e4c0] > [idgc3grid01:13006] [ 1] > /home/oci/murri/sw/lib/libopen-rte.so.0(orte_util_encode_pidmap+0xdb) > [0x2b352d00265b] > [idgc3grid01:13006] [ 2] > /home/oci/murri/sw/lib/libopen-rte.so.0(orte_odls_base_default_get_add_procs_data+0x676) > [0x2b352d00e0e6] > [idgc3grid01:13006] [ 3] > /home/oci/murri/sw/lib/libopen-rte.so.0(orte_plm_base_launch_apps+0xb8) > [0x2b352d015358] > [idgc3grid01:13006] [ 4] > /home/oci/murri/sw/lib/openmpi/mca_plm_rsh.so [0x2b352dcb9a80] > [idgc3grid01:13006] [ 5] mpirun [0x40345a] > [idgc3grid01:13006] [ 6] mpirun [0x402af3] > [idgc3grid01:13006] [ 7] /lib64/libc.so.6(__libc_start_main+0xf4) > [0x359361d974] > [idgc3grid01:13006] [ 8] mpirun [0x402a29] > [idgc3grid01:13006] *** End of error message *** > Segmentation fault > > I've already tried the suggestions posted to similar messages on the > list: "ldd" reports that the executable is linked with the libraries > in my home, not the system-wide OMPI:: > > [murri@idgc3grid01 hello_mpi.d]$ ldd hello_mpi > libmpi.so.0 => /home/oci/murri/sw/lib/libmpi.so.0 > (0x00002ad2bd6f2000) > libopen-rte.so.0 => /home/oci/murri/sw/lib/libopen-rte.so.0 > (0x00002ad2bd997000) > libopen-pal.so.0 => /home/oci/murri/sw/lib/libopen-pal.so.0 > (0x00002ad2bdbe3000) > libdl.so.2 => /lib64/libdl.so.2 (0x0000003593e00000) > libnsl.so.1 => /lib64/libnsl.so.1 (0x0000003596a00000) > libutil.so.1 => /lib64/libutil.so.1 (0x00000035a1000000) > libm.so.6 => /lib64/libm.so.6 (0x0000003593a00000) > libpthread.so.0 => /lib64/libpthread.so.0 (0x0000003594200000) > libc.so.6 => /lib64/libc.so.6 (0x0000003593600000) > /lib64/ld-linux-x86-64.so.2 (0x0000003593200000) > > I've also checked with "strace" that the "mpi.h" file used during > compile is the one in ~/sw/include and that all ".so" files being > loaded from OMPI are the ones in ~/sw/lib. I can ssh without password > to the target compute node. The "mpirun" and "mpicc" are the correct ones: > > [murri@idgc3grid01 hello_mpi.d]$ which mpirun > ~/sw/bin/mpirun > > [murri@idgc3grid01 hello_mpi.d]$ which mpicc > ~/sw/bin/mpicc > > > I'm pretty stuck now; can anybody give me a hint? > > Thanks a lot for any help! > > Best regards, > Riccardo > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users