Re: [OMPI users] Segfault on any MPI communication on head node
Are you running the same OS version and Open MPI version between the head node and regular nodes? On Sep 23, 2011, at 5:27 PM, Vassenkov, Phillip wrote: > Hey all, > I’ve been racking my brains over this for several days and was hoping anyone > could enlighten me. I’ll describe only the relevant parts of the > network/computer systems. There is one head node and a multitude of regular > nodes. The regular nodes are all identical to each other. If I run an mpi > program from one of the regular nodes to any other regular nodes, everything > works. If I include the head node in the hosts file, I get segfaults which > I’ll paste below along with sample code. The machines are all networked via > infiniband and Ethernet. The issue only arises when mpi communication occurs. > By this I mean, MPi_Init might succeed but the segfault always occurs on > MPI_Barrier or MPI_send/recv. I found a work around by disabling the openib > btl and enforcing that communications go over infiniband(if I don’t force > infiniband, it’ll go over Ethernet). This command works when the head node is > included in the hosts file: > mpirun --hostfile hostfile --mca btl ^openib --mca btl_tcp_if_include ib0 > -np 2 ./b.out > > Sample Code: > #include "mpi.h" > #include > int main(int argc, char *argv[]) > { >int rank, nprocs; > char* name[20]; > int maxlen = 20; > MPI_Init(&argc,&argv); > MPI_Comm_size(MPI_COMM_WORLD,&nprocs); > MPI_Comm_rank(MPI_COMM_WORLD,&rank); > MPI_Barrier(MPI_COMM_WORLD); > gethostname(name,maxlen); > printf("Hello, world. I am %d of %d and host %s \n", rank, nprocs,name); > fflush(stdout); > MPI_Finalize(); > return 0; > > } > > Segfault: > [pastec:19917] *** Process received signal *** > [pastec:19917] Signal: Segmentation fault (11) > [pastec:19917] Signal code: Address not mapped (1) > [pastec:19917] Failing at address: 0x8 > [pastec:19917] [ 0] /lib64/libpthread.so.0() [0x34a880eeb0] > [pastec:19917] [ 1] /usr/lib64/libmthca-rdmav2.so(+0x36aa) [0x7eff6430b6aa] > [pastec:19917] [ 2] > /usr/lib64/openmpi/lib/openmpi/mca_btl_openib.so(+0x133c9) [0x7eff66a163c9] > [pastec:19917] [ 3] > /usr/lib64/openmpi/lib/openmpi/mca_btl_openib.so(+0x1eb70) [0x7eff66a21b70] > [pastec:19917] [ 4] > /usr/lib64/openmpi/lib/openmpi/mca_btl_openib.so(+0x1ec89) [0x7eff66a21c89] > [pastec:19917] [ 5] > /usr/lib64/openmpi/lib/openmpi/mca_btl_openib.so(+0x1403d) [0x7eff66a1703d] > [pastec:19917] [ 6] /usr/lib64/openmpi/lib/openmpi/mca_pml_ob1.so(+0x120e6) > [0x7eff676670e6] > [pastec:19917] [ 7] /usr/lib64/openmpi/lib/openmpi/mca_pml_ob1.so(+0x6273) > [0x7eff6765b273] > [pastec:19917] [ 8] /usr/lib64/openmpi/lib/openmpi/mca_coll_tuned.so(+0x1b2f) > [0x7eff65539b2f] > [pastec:19917] [ 9] /usr/lib64/openmpi/lib/openmpi/mca_coll_tuned.so(+0xa5cf) > [0x7eff655425cf] > [pastec:19917] [10] /usr/lib64/openmpi/lib/libmpi.so.0(MPI_Barrier+0x9e) > [0x3a54c4c94e] > [pastec:19917] [11] ./b.out(main+0x6e) [0x400a42] > [pastec:19917] [12] /lib64/libc.so.6(__libc_start_main+0xfd) [0x34a841ee5d] > [pastec:19917] [13] ./b.out() [0x400919] > [pastec:19917] *** End of error message *** > [pastec.gtri.gatech.edu:19913] [[18526,0],0]-[[18526,1],1] > mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104) > -- > mpirun noticed that process rank 1 with PID 19917 on node > pastec.gtri.gatech.edu exited on signal 11 (Segmentation fault). > -- > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] PATH settings
On Sep 22, 2011, at 11:06 PM, Martin Siegert wrote: > I am trying to figure out how openmpi (1.4.3) sets its PATH > for executables. From the man page: > > Locating Files >If no relative or absolute path is specified for a file, Open MPI will >first look for files by searching the directories specified by the >--path option. If there is no --path option set or if the file is not >found at the --path location, then Open MPI will search the user’s PATH >environment variable as defined on the source node(s). Oops -- it's not the source node, it's the running node. That being said, sometimes they're the same thing, and sometimes the PATH is copied (by the underlying run-time environment) to the target node. > This does not appear to be entirely correct - as far as I can tell > openmpi always prepends its own bin directory to the PATH before > searching for the executable. Can that be switched off? It should not be doing that unless you are specifying the full path name to mpirun, or using the --prefix option. > Furthermore, openmpi appears to use > a) the current value of PATH on the node where mpiexec is running; > b) whatever PATH is used by ssh on the remote nodes. mpirun uses the $PATH local to where it is. We don't ship the PATH to the remote node unless you tell mpirun to via the -x PATH option (as you noted below). We've found that default shipping the PATH to remote nodes can cause unexpected problems. That being said, some run-time systems (e.g., SLURM, Torque) automatically ship the front-end PATH to the back-end machine(s) for you. Open MPI just "inherits" this PATH on the remote node, so to speak. ssh doesn't do this by default. Here's an example with 1.4.3 running SLURM on my test cluster at Cisco. This is in an SLURM allocation; I am running on the head node. Note that I'm a tcsh user, so I use "echo $path", not "echo $PATH": - [4:23] svbu-mpi:~ % hostname svbu-mpi.cisco.com # Note my original path [4:23] svbu-mpi:~ % echo $path /users/jsquyres/local/rhel5/bin /home/jsquyres/bogus/bin /users/jsquyres/local/bin /usr/local/bin /users/jsquyres/local/rhel5/bin /home/jsquyres/bogus/bin /users/jsquyres/local/bin /usr/local/bin /usr/kerberos/bin /usr/local/bin /bin /usr/bin /usr/X11R6/bin /opt/slurm/2.1.0/bin /data/home/ted/bin /data/home/ted/bin # Since I'm in a SLURM allocation, mpirun sends jobs to a remote node [4:23] svbu-mpi:~ % mpirun -np 1 hostname svbu-mpi020 # Here's my test script [4:23] svbu-mpi:~ % cat foo.csh #!/bin/tcsh -f echo $path # When I run this script through mpirun, the $path is the same # as was displayed above [4:23] svbu-mpi:~ % mpirun -np 1 foo.csh /users/jsquyres/local/rhel5/bin /home/jsquyres/bogus/bin /users/jsquyres/local/bin /usr/local/bin /users/jsquyres/local/rhel5/bin /home/jsquyres/bogus/bin /users/jsquyres/local/bin /usr/local/bin /usr/kerberos/bin /usr/local/bin /bin /usr/bin /usr/X11R6/bin /opt/slurm/2.1.0/bin /data/home/ted/bin /data/home/ted/bin # Now if I use the full path name to mpirun, I get an extra bonus # directory in the front of my $path -- the location of where # mpirun is located. [4:23] svbu-mpi:~ % /home/jsquyres/bogus/bin/mpirun -np 1 foo.csh /home/jsquyres/bogus/bin /home/jsquyres/bogus/bin /users/jsquyres/local/rhel5/bin /home/jsquyres/bogus/bin /users/jsquyres/local/bin /usr/local/bin /users/jsquyres/local/rhel5/bin /home/jsquyres/bogus/bin /users/jsquyres/local/bin /usr/local/bin /usr/kerberos/bin /usr/local/bin /bin /usr/bin /usr/X11R6/bin /opt/slurm/2.1.0/bin /data/home/ted/bin /data/home/ted/bin [4:23] svbu-mpi:~ % - > Thus, > > export PATH=/path/to/special/bin:$PATH > mpiexec -n 2 -H n1,n2 special > > (n1 being the local node) > will usually fail even if the directory structure is identical on > the two nodes. For this to work The PATH you set will be available on n1, but it depends on the underlying run-time launcher if it is available on n2. ssh will not copy your PATH to n2 by default, but others will (e.g., SLURM). > mpiexec -n 2 -H n1,n2 -x PATH special That will work for ssh in this case, yes. > What I would like to see is a configure option that allows me to configure > openmpi such that the current PATH on the node where mpiexec is running > is used as the PATH on all nodes (by default). Or is there a reason why > that is a really bad idea? If your nodes are not exactly the same, this can lead to all kinds of badness. That's why we didn't do it by default. -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] unresolvable R_X86_64_64 relocation against symbol `mpi_fortran_*
How does the target application compile / link itself? Try running "file" on the Open MPI libraries and/or your target application .o files to see what their bitness is, etc. On Sep 22, 2011, at 3:15 PM, Dmitry N. Mikushin wrote: > Hi Jeff, > > You're right because I also tried 1.4.3, and it's the same issue > there. But what could be wrong? I'm using the simplest form - > ../configure --prefix=/opt/openmpi_gcc-1.4.3/ and only installed > compilers are system-default gcc and gfortran 4.6.1. Distro is ubuntu > 11.10. There is no any mpi installed from packages, and no -m32 > options around. What else could be the source? > > Thanks, > - D. > > 2011/9/22 Jeff Squyres : >> This usually means that you're mixing compiler/linker flags somehow (e.g., >> built something with 32 bit, built something else with 64 bit, try to link >> them together). >> >> Can you verify that everything was built with all the same 32/64? >> >> >> On Sep 22, 2011, at 1:21 PM, Dmitry N. Mikushin wrote: >> >>> Hi, >>> >>> OpenMPI 1.5.4 compiled with gcc 4.6.1 and linked with target app gives >>> a load of linker messages like this one: >>> >>> /usr/bin/ld: ../../lib/libutil.a(parallel_utilities.o)(.debug_info+0x529d): >>> unresolvable R_X86_64_64 relocation against symbol >>> `mpi_fortran_argv_null_ >>> >>> There are a lot of similar messages about other mpi_fortran_ symbols. >>> Is it a known issue? >>> >>> Thanks, >>> - D. >>> ___ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> -- >> Jeff Squyres >> jsquy...@cisco.com >> For corporate legal information go to: >> http://www.cisco.com/web/about/doing_business/legal/cri/ >> >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] unresolvable R_X86_64_64 relocation against symbol `mpi_fortran_*
Hi Jeff, Today I've verified this application on the Feroda 15 x86_64, where I'm usually building OpenMPI from source using the same method. Result: no link errors there! So, the issue is likely ubuntu-specific. Target application is compiled linked with mpif90 pointing to /opt/openmpi_gcc-1.5.4/bin/mpif90 I built. Regarding architectures, everything in target folders and OpenMPI installation is ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, not stripped - D. 2011/9/24 Jeff Squyres : > How does the target application compile / link itself? > > Try running "file" on the Open MPI libraries and/or your target application > .o files to see what their bitness is, etc. > > > On Sep 22, 2011, at 3:15 PM, Dmitry N. Mikushin wrote: > >> Hi Jeff, >> >> You're right because I also tried 1.4.3, and it's the same issue >> there. But what could be wrong? I'm using the simplest form - >> ../configure --prefix=/opt/openmpi_gcc-1.4.3/ and only installed >> compilers are system-default gcc and gfortran 4.6.1. Distro is ubuntu >> 11.10. There is no any mpi installed from packages, and no -m32 >> options around. What else could be the source? >> >> Thanks, >> - D. >> >> 2011/9/22 Jeff Squyres : >>> This usually means that you're mixing compiler/linker flags somehow (e.g., >>> built something with 32 bit, built something else with 64 bit, try to link >>> them together). >>> >>> Can you verify that everything was built with all the same 32/64? >>> >>> >>> On Sep 22, 2011, at 1:21 PM, Dmitry N. Mikushin wrote: >>> Hi, OpenMPI 1.5.4 compiled with gcc 4.6.1 and linked with target app gives a load of linker messages like this one: /usr/bin/ld: ../../lib/libutil.a(parallel_utilities.o)(.debug_info+0x529d): unresolvable R_X86_64_64 relocation against symbol `mpi_fortran_argv_null_ There are a lot of similar messages about other mpi_fortran_ symbols. Is it a known issue? Thanks, - D. ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> >>> -- >>> Jeff Squyres >>> jsquy...@cisco.com >>> For corporate legal information go to: >>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>> >>> >>> ___ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] unresolvable R_X86_64_64 relocation against symbol `mpi_fortran_*
Can you compile / link simple OMPI applications without this problem? On Sep 24, 2011, at 7:54 AM, Dmitry N. Mikushin wrote: > Hi Jeff, > > Today I've verified this application on the Feroda 15 x86_64, where > I'm usually building OpenMPI from source using the same method. > Result: no link errors there! So, the issue is likely ubuntu-specific. > > Target application is compiled linked with mpif90 pointing to > /opt/openmpi_gcc-1.5.4/bin/mpif90 I built. > > Regarding architectures, everything in target folders and OpenMPI > installation is > ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically > linked, not stripped > > - D. > > 2011/9/24 Jeff Squyres : >> How does the target application compile / link itself? >> >> Try running "file" on the Open MPI libraries and/or your target application >> .o files to see what their bitness is, etc. >> >> >> On Sep 22, 2011, at 3:15 PM, Dmitry N. Mikushin wrote: >> >>> Hi Jeff, >>> >>> You're right because I also tried 1.4.3, and it's the same issue >>> there. But what could be wrong? I'm using the simplest form - >>> ../configure --prefix=/opt/openmpi_gcc-1.4.3/ and only installed >>> compilers are system-default gcc and gfortran 4.6.1. Distro is ubuntu >>> 11.10. There is no any mpi installed from packages, and no -m32 >>> options around. What else could be the source? >>> >>> Thanks, >>> - D. >>> >>> 2011/9/22 Jeff Squyres : This usually means that you're mixing compiler/linker flags somehow (e.g., built something with 32 bit, built something else with 64 bit, try to link them together). Can you verify that everything was built with all the same 32/64? On Sep 22, 2011, at 1:21 PM, Dmitry N. Mikushin wrote: > Hi, > > OpenMPI 1.5.4 compiled with gcc 4.6.1 and linked with target app gives > a load of linker messages like this one: > > /usr/bin/ld: > ../../lib/libutil.a(parallel_utilities.o)(.debug_info+0x529d): > unresolvable R_X86_64_64 relocation against symbol > `mpi_fortran_argv_null_ > > There are a lot of similar messages about other mpi_fortran_ symbols. > Is it a known issue? > > Thanks, > - D. > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/ ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users >>> ___ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> -- >> Jeff Squyres >> jsquy...@cisco.com >> For corporate legal information go to: >> http://www.cisco.com/web/about/doing_business/legal/cri/ >> >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] unresolvable R_X86_64_64 relocation against symbol `mpi_fortran_*
Compile and link - yes, but it turns out there was some unnoticed compilation error because ./hellompi: error while loading shared libraries: libmpi_f77.so.1: cannot open shared object file: No such file or directory and this library does not exist. Hm. 2011/9/24 Jeff Squyres : > Can you compile / link simple OMPI applications without this problem? > > On Sep 24, 2011, at 7:54 AM, Dmitry N. Mikushin wrote: > >> Hi Jeff, >> >> Today I've verified this application on the Feroda 15 x86_64, where >> I'm usually building OpenMPI from source using the same method. >> Result: no link errors there! So, the issue is likely ubuntu-specific. >> >> Target application is compiled linked with mpif90 pointing to >> /opt/openmpi_gcc-1.5.4/bin/mpif90 I built. >> >> Regarding architectures, everything in target folders and OpenMPI >> installation is >> ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically >> linked, not stripped >> >> - D. >> >> 2011/9/24 Jeff Squyres : >>> How does the target application compile / link itself? >>> >>> Try running "file" on the Open MPI libraries and/or your target application >>> .o files to see what their bitness is, etc. >>> >>> >>> On Sep 22, 2011, at 3:15 PM, Dmitry N. Mikushin wrote: >>> Hi Jeff, You're right because I also tried 1.4.3, and it's the same issue there. But what could be wrong? I'm using the simplest form - ../configure --prefix=/opt/openmpi_gcc-1.4.3/ and only installed compilers are system-default gcc and gfortran 4.6.1. Distro is ubuntu 11.10. There is no any mpi installed from packages, and no -m32 options around. What else could be the source? Thanks, - D. 2011/9/22 Jeff Squyres : > This usually means that you're mixing compiler/linker flags somehow > (e.g., built something with 32 bit, built something else with 64 bit, try > to link them together). > > Can you verify that everything was built with all the same 32/64? > > > On Sep 22, 2011, at 1:21 PM, Dmitry N. Mikushin wrote: > >> Hi, >> >> OpenMPI 1.5.4 compiled with gcc 4.6.1 and linked with target app gives >> a load of linker messages like this one: >> >> /usr/bin/ld: >> ../../lib/libutil.a(parallel_utilities.o)(.debug_info+0x529d): >> unresolvable R_X86_64_64 relocation against symbol >> `mpi_fortran_argv_null_ >> >> There are a lot of similar messages about other mpi_fortran_ symbols. >> Is it a known issue? >> >> Thanks, >> - D. >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> >>> -- >>> Jeff Squyres >>> jsquy...@cisco.com >>> For corporate legal information go to: >>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>> >>> >>> ___ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] unresolvable R_X86_64_64 relocation against symbol `mpi_fortran_*
Check the output from when you ran Open MPI's configure and "make all" -- did it decide to build the F77 interface? Also check that gcc and gfortran output .o files of the same bitness / type. On Sep 24, 2011, at 8:07 AM, Dmitry N. Mikushin wrote: > Compile and link - yes, but it turns out there was some unnoticed > compilation error because > > ./hellompi: error while loading shared libraries: libmpi_f77.so.1: > cannot open shared object file: No such file or directory > > and this library does not exist. > > Hm. > > 2011/9/24 Jeff Squyres : >> Can you compile / link simple OMPI applications without this problem? >> >> On Sep 24, 2011, at 7:54 AM, Dmitry N. Mikushin wrote: >> >>> Hi Jeff, >>> >>> Today I've verified this application on the Feroda 15 x86_64, where >>> I'm usually building OpenMPI from source using the same method. >>> Result: no link errors there! So, the issue is likely ubuntu-specific. >>> >>> Target application is compiled linked with mpif90 pointing to >>> /opt/openmpi_gcc-1.5.4/bin/mpif90 I built. >>> >>> Regarding architectures, everything in target folders and OpenMPI >>> installation is >>> ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically >>> linked, not stripped >>> >>> - D. >>> >>> 2011/9/24 Jeff Squyres : How does the target application compile / link itself? Try running "file" on the Open MPI libraries and/or your target application .o files to see what their bitness is, etc. On Sep 22, 2011, at 3:15 PM, Dmitry N. Mikushin wrote: > Hi Jeff, > > You're right because I also tried 1.4.3, and it's the same issue > there. But what could be wrong? I'm using the simplest form - > ../configure --prefix=/opt/openmpi_gcc-1.4.3/ and only installed > compilers are system-default gcc and gfortran 4.6.1. Distro is ubuntu > 11.10. There is no any mpi installed from packages, and no -m32 > options around. What else could be the source? > > Thanks, > - D. > > 2011/9/22 Jeff Squyres : >> This usually means that you're mixing compiler/linker flags somehow >> (e.g., built something with 32 bit, built something else with 64 bit, >> try to link them together). >> >> Can you verify that everything was built with all the same 32/64? >> >> >> On Sep 22, 2011, at 1:21 PM, Dmitry N. Mikushin wrote: >> >>> Hi, >>> >>> OpenMPI 1.5.4 compiled with gcc 4.6.1 and linked with target app gives >>> a load of linker messages like this one: >>> >>> /usr/bin/ld: >>> ../../lib/libutil.a(parallel_utilities.o)(.debug_info+0x529d): >>> unresolvable R_X86_64_64 relocation against symbol >>> `mpi_fortran_argv_null_ >>> >>> There are a lot of similar messages about other mpi_fortran_ symbols. >>> Is it a known issue? >>> >>> Thanks, >>> - D. >>> ___ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> -- >> Jeff Squyres >> jsquy...@cisco.com >> For corporate legal information go to: >> http://www.cisco.com/web/about/doing_business/legal/cri/ >> >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/ ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users >>> ___ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> -- >> Jeff Squyres >> jsquy...@cisco.com >> For corporate legal information go to: >> http://www.cisco.com/web/about/doing_business/legal/cri/ >> >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] on cluster job slowdown near end
You might want to run some profiling / timing to see what parts of the application start running slower over time. Also check for memory leaks. On Sep 22, 2011, at 5:44 PM, Tom Hilinski wrote: > Hi, A job I am running slows down as it approaches the end. I'd > appreciate any ideas you may have on possible cause or what else I can > look at for diagnostic info. > > Environment: > * Linux cluster, very recent version of Fedora. > * openmpi 1.5 > > Characteristics of job: > * Tasks are all the same size and duration. > * 56K tasks, but multiple tasks given to each process. > * Typically run 120 processes. > * Slowdown starts at ~52K completed, then rate of completion of each > task declines geometrically from ~1k/minute to 4/minute at 54K. > > Here are some queries done when the slowdown occurs: > > * "ps" on master node - most processes in suspend state: > F S UID PID PPID C PRI NI ADDR SZ WCHAN TTY TIME CMD > 0 S 3348 27933 15675 0 80 0 - 13608 poll_s pts/000:00:00 mpiexec > 0 S 3348 28009 27933 14 80 0 - 227632 epoll_ pts/0 00:08:13 C5MPI > 0 S 3348 28011 27933 14 80 0 - 227672 epoll_ pts/0 00:08:17 C5MPI > 0 S 3348 28013 27933 13 80 0 - 227713 epoll_ pts/0 00:08:06 C5MPI > 0 S 3348 28015 27933 13 80 0 - 227844 epoll_ pts/0 00:08:02 C5MPI > 0 S 3348 28017 27933 14 80 0 - 227849 epoll_ pts/0 00:08:13 C5MPI > 0 S 3348 28019 27933 13 80 0 - 227892 epoll_ pts/0 00:08:07 C5MPI > > * file handles (allocated handle count is ~constant): > $ cat /proc/sys/fs/file-nr > 39680 801014 > > * Processes in a suspend or run state (varies): > $ orte-top -pid 27933 | grep ' S |' | wc -l > 124 > $ orte-top -pid 27933 | grep ' R |' > Rank | Nodename | Command | Pid | State | Time | Pri | #threads | > Vsize |RSS | Peak Vsize | Shr Size | > 0 | rubel-001 | C5MPI | 14700 | R | 2.2H | 20 |1 | > 246208 | 12660 | 246208 |17664 | > 1 | rubel-001 | C5MPI | 14702 | R | 2.2H | 20 |1 | > 245360 | 44860 | 245360 |17664 | > > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] problems with Intel 12.x compilers and OpenMPI (1.4.3)
As a pure guess, it might actually be this one: - Fix to detect and avoid overlapping memcpy(). Thanks to Francis Pellegrini for identifying the issue. We're actually very close to releasing 1.4.4 -- using the latest RC should be pretty safe. On Sep 23, 2011, at 5:51 AM, Paul Kapinos wrote: > Hi Open MPI volks, > > we see some quite strange effects with our installations of Open MPI 1.4.3 > with Intel 12.x compilers, which makes us puzzling: Different programs > reproducibly deadlock or die with errors alike the below-listed ones. > > Some of the errors looks alike programming issue at first look (well, a > deadlock *is* usually a programming error) but we do not believe it is so: > the errors arise in many well-tested codes including HPL (*) and only with a > special compiler + Open MPI version (Intel 12.x compiler + open MPI 1.4.3) > and only on special number of processes (usually high ones). For example, HPL > reproducible deadlocks with 72 procs and dies with error message #2 with 384 > processes. > > All this errors seem to be somehow related to MPI communicators; and 1.4.4rc3 > and in 1.5.3 and 1.5.4 seem not to have this problem. Also 1.4.3 if using > together with Intel 11.x compielr series seem to be unproblematic. So > probably this: > > (1.4.4 release notes:) > - Fixed a segv in MPI_Comm_create when called with GROUP_EMPTY. > Thanks to Dominik Goeddeke for finding this. > > is also fix for our issues? Or maybe not, because 1.5.3 is _older_ than this > fix? > > As far as we workarounded the problem by switching our production to 1.5.3 > this issue is not a "burning" one; but I decieded still to post this because > any issue on such fundamental things may be interesting for developers. > > Best wishes, > Paul Kapinos > > > (*) http://www.netlib.org/benchmark/hpl/ > > > Fatal error in MPI_Comm_size: Invalid communicator, error stack: > MPI_Comm_size(111): MPI_Comm_size(comm=0x0, size=0x6f4a90) failed > MPI_Comm_size(69).: Invalid communicator > > > [linuxbdc05.rz.RWTH-Aachen.DE:23219] *** An error occurred in MPI_Comm_split > [linuxbdc05.rz.RWTH-Aachen.DE:23219] *** on communicator MPI COMMUNICATOR 3 > SPLIT FROM 0 > [linuxbdc05.rz.RWTH-Aachen.DE:23219] *** MPI_ERR_IN_STATUS: error code in > status > [linuxbdc05.rz.RWTH-Aachen.DE:23219] *** MPI_ERRORS_ARE_FATAL (your MPI job > will now abort) > > > forrtl: severe (71): integer divide by zero > Image PC Routine Line Source > libmpi.so.0 2D9EDF52 Unknown Unknown Unknown > libmpi.so.0 2D9EE45D Unknown Unknown Unknown > libmpi.so.0 2D9C3375 Unknown Unknown Unknown > libmpi_f77.so.0 2D75C37A Unknown Unknown Unknown > vasp_mpi_gamma 0057E010 Unknown Unknown Unknown > vasp_mpi_gamma 0059F636 Unknown Unknown Unknown > vasp_mpi_gamma 00416C5A Unknown Unknown Unknown > vasp_mpi_gamma 00A62BEE Unknown Unknown Unknown > libc.so.6 003EEB61EC5D Unknown Unknown Unknown > vasp_mpi_gamma 00416A29 Unknown Unknown Unknown > > > -- > Dipl.-Inform. Paul Kapinos - High Performance Computing, > RWTH Aachen University, Center for Computing and Communication > Seffenter Weg 23, D 52074 Aachen (Germany) > Tel: +49 241/80-24915 > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] Trouble compiling 1.4.3 with PGI 10.9 compilers
Just out of curiosity, does Open MPI 1.5.4 build properly? We've seen problems with the PGI compiler suite before -- it *does* look like a problem with libtool-building issues; e.g., a switch is too old or is missing or something. Meaning: it looks like PGI thinks it's trying to build an application, not a library. This is usually bit rot in libtool (i.e., PGI may have changed their options, but we're using an older Libtool in the 1.4.x series that doesn't know about this option). I do note that we fixed some libtool issues in the 1.4.4 tarball; could you try the 1.4.4rc and see if that fixes the issue? If not, we might have missed some patches to bring over to the v1.4 branch. http://www.open-mpi.org/software/ompi/v1.4/ On Sep 20, 2011, at 1:16 PM, Blosch, Edwin L wrote: > I'm having trouble building 1.4.3 using PGI 10.9. I searched the list > archives briefly but I didn't stumble across anything that looked like the > same problem, so I thought I'd ask if an expert might recognize the nature of > the problem here. > > The configure command: > > ./configure --prefix=/release/openmpi-pgi --without-tm --without-sge > --enable-mpirun-prefix-by-default --enable-contrib-no-build=vt > --enable-mca-no-build=maffinity --disable-per-user-config-files > --disable-io-romio --with-mpi-f90-size=small --enable-static --disable-shared > --with-wrapper-cflags=-Msignextend --with-wrapper-cxxflags=-Msignextend > CXX=/appserv/pgi/linux86-64/10.9/bin/pgCC > CC=/appserv/pgi/linux86-64/10.9/bin/pgcc 'CFLAGS= -O2 -Mcache_align -Minfo > -Msignextend -Msignextend' 'CXXFLAGS= -O2 -Mcache_align -Minfo -Msignextend > -Msignextend' F77=/appserv/pgi/linux86-64/10.9/bin/pgf95 > 'FFLAGS=-D_GNU_SOURCE -O2 -Mcache_align -Minfo -Munixlogical' > FC=/appserv/pgi/linux86-64/10.9/bin/pgf95 'FCFLAGS=-D_GNU_SOURCE -O2 > -Mcache_align -Minfo -Munixlogical' 'LDFLAGS= -Bstatic_pgi' > > The place where the build eventually dies: > > /bin/sh ../../../libtool --tag=CXX --mode=link > /appserv/pgi/linux86-64/10.9/bin/pgCC -DNDEBUG -O2 -Mcache_align -Minfo > -Msignextend -Msignextend -version-info 0:1:0 -export-dynamic -Bstatic_pgi > -o libmpi_cxx.la -rpath /release/cfd/openmpi-pgi/lib mpicxx.lo intercepts.lo > comm.lo datatype.lo win.lo file.lo ../../../ompi/libmpi.la -lnsl -lutil > -lpthread > libtool: link: tpldir=Template.dir > libtool: link: rm -rf Template.dir > libtool: link: /appserv/pgi/linux86-64/10.9/bin/pgCC --prelink_objects > --instantiation_dir Template.dir mpicxx.o intercepts.o comm.o datatype.o > win.o file.o > pgCC-Warning-prelink_objects switch is deprecated > pgCC-Warning-instantiation_dir switch is deprecated > /usr/lib64/crt1.o: In function `_start': > /usr/src/packages/BUILD/glibc-2.9/csu/../sysdeps/x86_64/elf/start.S:109: > undefined reference to `main' > mpicxx.o: In function `__sti___9_mpicxx_cc_a6befbec': > (.text+0x49): undefined reference to `ompi_mpi_errors_are_fatal' > mpicxx.o: In function `__sti___9_mpicxx_cc_a6befbec': > (.text+0x62): undefined reference to `ompi_mpi_errors_return' > mpicxx.o: In function `__sti___9_mpicxx_cc_a6befbec': > (.text+0x7b): undefined reference to `ompi_mpi_errors_throw_exceptions' > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] custom sparse collective non-reproducible deadlock, MPI_Sendrecv, MPI_Isend/MPI_Irecv or MPI_Send/MPI_Recv question
Some random points: 1. Are your counts ever 0? In principle, method 1 should be fine, I think. But with blocking, I *think* you should be fine, but I haven't thought hard about this -- I have a nagging feeling that there might be a possibility of deadlock in there, but I could be wrong. 2. It's been a long, long time since I've used the STL. Is &foo[x][y] guaranteed to give the address of a contiguous buffer that MPI can use? More specifically, is &foo[x][y] guaranteed to be equal to (&foo[x][y + N] - N * sizeof(T))? I have a dim recollection of needing to use .cptr() or something like that... but this is a very old memory from many years ago. 3. Why not use MPI_Alltoallw? On Sep 17, 2011, at 10:06 PM, Evghenii Gaburov wrote: > Hi All, > > My MPI program's basic task consists of regularly establishing point-to-point > communication with other procs via MPI_Alltoall, and then to communicate > data. I tested it on two HPC clusters with 32-256 MPI tasks. One of the > systems (HPC1) this custom collective runs flawlessly, while on another one > (HPC2) the collective causes non-reproducible deadlocks (after a day of > running, or after of few hours or so). So, I want to figure out whether it is > a system (HPC2) bug that I can communicate to HPC admins, or a subtle bug in > my code that needs to be fixed. One possibly important point, I communicate > huge amount of data between tasks (up to ~2GB of data) in several all2all > calls. > > I would like to have expert eyes to look at the code to confirm or disprove > that the code is deadlock-safe. I have implemented several methods (METHOD1 - > METHOD4), that, if I am not mistaken, should in principle be deadlock safe. > However, as a beginner MPI user, I can easily miss something subtle, as such > I seek you help with this! I mostly used METHOD4 which have caused periodic > deadlock, after having deadlocks with METHOD1 and METHOD2. On HPC1 none these > methods deadlock in my runs. METHOD3 I am currently testing, so cannot > comment on it as yet but will later; however, I will be happy to hear your > comments. > > Both system use openmpi-1.4.3. > > Your answers will be of great help! Thanks! > > Cheers, > Evghenii > > Here is the code snippet: > > template >void all2all(std::vector sbuf[], std::vector rbuf[], >const int myid, >const int nproc) >{ > static int nsend[NMAXPROC], nrecv[NMAXPROC]; >for (int p = 0; p < nproc; p++) > nsend[p] = sbuf[p].size(); >MPI_Alltoall(nsend, 1, MPI_INT, nrecv, 1, MPI_INT, MPI_COMM_WORLD); > // let the other tasks know how much data they will receive from this one > > #ifdef _METHOD1_ > > static MPI_Status stat[NMAXPROC ]; > static MPI_Request req[NMAXPROC*2]; > int nreq = 0; > for (int p = 0; p < nproc; p++) > if (p != myid) > { > const int scount = nsend[p]; > const int rcount = nrecv[p]; > rbuf[p].resize(rcount); > if (scount > 0) MPI_Isend(&sbuf[p][0], nscount, datatype(), p, > 1, MPI_COMM_WORLD, &req[nreq++]); > if (rcount > 0) MPI_Irecv(&rbuf[p][0], rcount, datatype(), p, > 1, MPI_COMM_WORLD, &req[nreq++]); > } > rbuf[myid] = sbuf[myid]; > MPI_Waitall(nreq, req, stat); > > #elif defined _METHOD2_ > > static MPI_Status stat; > for (int p = 0; p < nproc; p++) > if (p != myid) > { >const int scount = nsend[p]*scale; >const int rcount = nrecv[p]*scale; >rbuf[p].resize(rcount); >if (scount + rcount > 0) > MPI_Sendrecv(&sbuf[p][0], scount, datatype(), p, 1, >&rbuf[p][0], rcount, datatype(), p, 1, > MPI_COMM_WORLD, &stat); > } >rbuf[myid] = sbuf[myid]; > > #elif defined _METHOD3_ > > static MPI_Status stat[NMAXPROC ]; > static MPI_Request req[NMAXPROC*2]; > for (int dist = 1; dist < nproc; dist++) > { > const int src = (nproc + myid - dist) % nproc; > const int dst = (nproc + myid + dist) % nproc; > const int scount = nsend[dst]*scale; > const int rcount = nrecv[src]*scale; > rbuf[src].resize(rcount); > int nreq = 0; > if (scount > 0) MPI_Isend(&sbuf[dst][0], scount, datatype(), > dst, 1, MPI_COMM_WORLD, &req[nreq++]); > if (rcount > 0) MPI_Irecv(&rbuf[src][0], rcount, datatype(), > src, 1, MPI_COMM_WORLD, &req[nreq++]); > MPI_Waitall(nreq, req, stat); > } > rbuf[myid] = sbuf[myid]; > > #elif defined _METHOD4_ > > static MPI_Status stat; > for (int dist = 1; dist < nproc; dist++) > { > const int src = (nproc + myid - dist) % nproc; > const int dst = (nproc + myid + dist) % nproc; > const int scount = nsend[dst]*scale; > const in
Re: [OMPI users] freezing in mpi_allreduce operation
Holy crimminey, I'm totally lost in your Fortran syntax. :-) What you describe might be a bug in our MPI_IN_PLACE handling for MPI_ALLREDUCE. Could you possible make a small test case that a) we can run, and b) uses straightforward Fortran? (avoid using terms like "assumed shape" and "assumed size" and ...any other Fortran stuff that confuses simple C programmers like us :-) ) What version of Open MPI is this? On Sep 8, 2011, at 5:59 PM, Greg Fischer wrote: > Note also that coding the mpi_allreduce as: > >call > mpi_allreduce(MPI_IN_PLACE,phim(0,1,1,1,grp),phim_size*im*jm*kmloc(coords(2)+1),mpi_real,mpi_sum,ang_com,ierr) > > results in the same freezing behavior in the 60th iteration. (I don't recall > why the arrays were being passed, possibly just a mistake.) > > > On Thu, Sep 8, 2011 at 4:17 PM, Greg Fischer wrote: > I am seeing mpi_allreduce operations freeze execution of my code on some > moderately-sized problems. The freeze does not manifest itself in every > problem. In addition, it is in a portion of the code that is repeated many > times. In the problem discussed below, the problem appears in the 60th > iteration. > > The current test case that I'm looking at is a 64-processor job. This > particular mpi_allreduce call applies to all 64 processors, with each > communicator in the call containing a total of 4 processors. When I add > print statements before and after the offending line, I see that all 64 > processors successfully make it to the mpi_allreduce call, but only 32 > successfully exit. Stack traces on the other 32 yield something along the > lines of the trace listed at the bottom of this message. The call, itself, > looks like: > > call mpi_allreduce(MPI_IN_PLACE, > phim(0:(phim_size-1),1:im,1:jm,1:kmloc(coords(2)+1),grp), & > > phim_size*im*jm*kmloc(coords(2)+1),mpi_real,mpi_sum,ang_com,ierr) > > These messages are sized to remain under the 32-bit integer size limitation > for the "count" parameter. The intent is to perform the allreduce operation > on a contiguous block of the array. Previously, I had been passing an > assumed-shape array (i.e. phim(:,:,:,:,grp), but found some documentation > indicating that was potentially dangerous. Making the change from assumed- > to explicit-shaped arrays doesn't solve the problem. However, if I declare > an additional array and use separate send and receive buffers: > > call > mpi_allreduce(phim_local,phim_global,phim_size*im*jm*kmloc(coords(2)+1),mpi_real,mpi_sum,ang_com,ierr) > phim(:,:,:,:,grp) = phim_global > > Then the problem goes away, and every thing works normally. Does anyone have > any insight as to what may be happening here? I'm using "include 'mpif.h'" > rather than the f90 module, does that potentially explain this? > > Thanks, > Greg > > Stack trace(s) for thread: 1 > - > [0] (1 processes) > - > main() at ?:? > solver() at solver.f90:31 > solver_q_down() at solver_q_down.f90:52 > iter() at iter.f90:56 > mcalc() at mcalc.f90:38 > pmpi_allreduce__() at ?:? > PMPI_Allreduce() at ?:? > ompi_coll_tuned_allreduce_intra_dec_fixed() at ?:? > ompi_coll_tuned_allreduce_intra_ring_segmented() at ?:? > ompi_coll_tuned_sendrecv_actual() at ?:? > ompi_request_default_wait_all() at ?:? > opal_progress() at ?:? > Stack trace(s) for thread: 2 > - > [0] (1 processes) > - > start_thread() at ?:? > btl_openib_async_thread() at ?:? > poll() at ?:? > Stack trace(s) for thread: 3 > - > [0] (1 processes) > - > start_thread() at ?:? > service_thread_start() at ?:? > select() at ?:? > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] PATH settings
Thanks, Jeff, for the details! On Sat, Sep 24, 2011 at 07:26:49AM -0400, Jeff Squyres wrote: > On Sep 22, 2011, at 11:06 PM, Martin Siegert wrote: > > > I am trying to figure out how openmpi (1.4.3) sets its PATH > > for executables. From the man page: > > > > Locating Files > >If no relative or absolute path is specified for a file, Open MPI will > >first look for files by searching the directories specified by the > >--path option. If there is no --path option set or if the file is not > >found at the --path location, then Open MPI will search the user’s PATH > >environment variable as defined on the source node(s). > > Oops -- it's not the source node, it's the running node. That being said, > sometimes they're the same thing, and sometimes the PATH is copied (by the > underlying run-time environment) to the target node. > > > This does not appear to be entirely correct - as far as I can tell > > openmpi always prepends its own bin directory to the PATH before > > searching for the executable. Can that be switched off? > > It should not be doing that unless you are specifying the full path name to > mpirun, or using the --prefix option. By now I recognize that my tests where flawed in in several aspects: 1) the path settings depend on whether you specify the full path to mpiexec (as you mention), i.e., "/usr/local/openmpi/bin/mpiexec" does things differently than "mpiexec" even though the executable is the same. 2) it makes a difference whether mpiexec runs from a torque batch job or interactively (as you say below as well). Nevertheless, I cannot avoid mpiexec prepending its own directory to the PATH. This is what I tried: dev:~> echo $PATH /usr/local/bin:/usr/local/openmpi/bin:/usr/local/moab/bin:/usr/local/torque/bin:/bin:/usr/bin:/home/siegert/bin # this is the default PATH on every node dev:~> cat /home/siegert/scratch/test/path-0.0.1/bin/path.sh #!/bin/bash echo "`hostname`, $0:" echo $PATH dev:~> cat path.pbs #!/bin/bash #PBS -N path #PBS -l walltime=1:00 #PBS -l nodes=2:ppn=1 export PATH=/home/siegert/scratch/test/path-0.0.1/bin:$PATH echo $PATH mpiexec path.sh dev:~> qsub path.pbs 14.dev dev:~> cat path.o14 /home/siegert/scratch/test/path-0.0.1/bin:/usr/local/bin:/usr/local/openmpi/bin:/usr/local/moab/bin:/usr/local/torque/bin:/bin:/usr/bin:/home/siegert/bin:/home/siegert/bin b414, /home/siegert/scratch/test/path-0.0.1/bin/path.sh: /usr/local/openmpi/bin:/usr/local/openmpi/bin:/home/siegert/scratch/test/path-0.0.1/bin:/usr/local/bin:/usr/local/openmpi/bin:/usr/local/moab/bin:/usr/local/torque/bin:/bin:/usr/bin:/home/siegert/bin:/home/siegert/bin b413, /home/siegert/scratch/test/path-0.0.1/bin/path.sh: /usr/local/openmpi/bin:/usr/local/openmpi/bin:/usr/local/openmpi/bin:/home/siegert/scratch/test/path-0.0.1/bin:/usr/local/bin:/usr/local/openmpi/bin:/usr/local/moab/bin:/usr/local/torque/bin:/bin:/usr/bin:/home/siegert/bin:/home/siegert/bin Thus, on the local node (where mpiexec is run) /usr/local/openmpi/bin is prepended twice, on the remote node /usr/local/openmpi/bin is prepended three times. But, this is the first point where I tricked myself: our "mpiexec" is a wrapper script (in /usr/local/bin) that calls /usr/local/openmpi/bin/mpiexec: dev:~> which mpiexec /usr/local/bin/mpiexec dev:~> which orterun /usr/local/openmpi/bin/orterun But, when I replace "mpiexec" in path.pbs with "orterun" the following happens: dev:~> cat path.pbs #!/bin/bash #PBS -N path #PBS -l walltime=1:00 #PBS -l nodes=2:ppn=1 export PATH=/home/siegert/scratch/test/path-0.0.1/bin:$PATH echo $PATH orterun path.sh dev:~> qsub path.pbs 15.dev dev:~> cat path.o15 /home/siegert/scratch/test/path-0.0.1/bin:/usr/local/bin:/usr/local/openmpi/bin:/usr/local/moab/bin:/usr/local/torque/bin:/bin:/usr/bin:/home/siegert/bin:/home/siegert/bin b414, /home/siegert/scratch/test/path-0.0.1/bin/path.sh: /usr/local/openmpi-1.4.3/bin:/usr/local/openmpi-1.4.3/bin:/home/siegert/scratch/test/path-0.0.1/bin:/usr/local/bin:/usr/local/openmpi/bin:/usr/local/moab/bin:/usr/local/torque/bin:/bin:/usr/bin:/home/siegert/bin:/home/siegert/bin b413, /home/siegert/scratch/test/path-0.0.1/bin/path.sh: /usr/local/openmpi-1.4.3/bin:/usr/local/openmpi-1.4.3/bin:/usr/local/openmpi-1.4.3/bin:/home/siegert/scratch/test/path-0.0.1/bin:/usr/local/bin:/usr/local/openmpi/bin:/usr/local/moab/bin:/usr/local/torque/bin:/bin:/usr/bin:/home/siegert/bin:/home/siegert/bin It appears that now "orterun" does something like "readlink -f $0": /usr/local/openmpi is actually a softlink to /usr/local/openmpi-1.4.3. Anyway, again the directory where the orterun executable is located gets prepended twice on the local and three times on the remote node. Only adding the --noprefix option to orterun avoids the prepending of the directory (when calling "/usr/local/openmpi/bin/mpiexec --noprefix" the --noprefix flag has no effect). I guess, I could achieve what I want by using "orterun --noprefix" from the wrapper script. > > Furthermore, o