[OMPI users] How to override default hostfile to specify host
Hi, If I use "orterun -H " and does not belong in the default hostfile ("etc/openmpi-default-hostfile"), openmpi gives an error. Is there an easy way to get the aforementioned command to work without specifying a different hostfile with in it? Thank you.
[OMPI users] OpenMPI w valgrind: need to recompile?
Hi, I am building libraries against OpenMPI, and then applications using those libraries. It was unclear from the FAQ at http://www.open-mpi.org/faq/?category=debugging#memchecker_how whether the libraries need to be recompiled and the application relinked using valgrind-enabled mpicc etc, in order to get valgrind to work. In other words, can I run a valgrind-disabled openmpi app with a valgrind-enabled orterun, or do I have to recompile/relink the whole thing? Is the answer different for shared vs static openmpi libraries? The FAQ also states that openmpi from v 1.5 provides a valgrind suppression file. Is this a mistake in the FAQ or is the suppression file not available with the latest stable release (1.4)? If not, can the 1.5 file be used with 1.4? Thanks, saurabh _ Windows Live: Friends get your Flickr, Yelp, and Digg updates when they e-mail you. http://www.microsoft.com/middleeast/windows/windowslive/see-it-in-action/social-network-basics.aspx?ocid=PID23461::T:WLMTAGL:ON:WL:en-xm:SI_SB_3:092010
[OMPI users] Problems running 1.8.8 and compiling 1.10.1 on Redhat EL7
Hi, On Redhat Enterprise Linux 7, I am facing the following problems. 1. With OpenMPI 1.8.8, everything builds, but the following error appears on running: orterun -np 2 hello_cxx hello_cxx: route/tc.c:973: rtnl_tc_register: Assertion `0' failed. hello_cxx: route/tc.c:973: rtnl_tc_register: Assertion `0' failed. -- orterun noticed that process rank 0 with PID 18229 on node sim18 exited on signal 6 (Aborted). -- 2. With OpenMPI 1.10.1, there is a failure to compile oshmem_info: /bin/ld: ../../../oshmem/.libs/liboshmem.a(memheap_base_static.o): undefined reference to symbol '_end' /bin/ld: note: '_end' is defined in DSO /lib64/libnl-route-3.so.200 so try adding it to the linker command line /lib64/libnl-route-3.so.200: could not read symbols: Invalid operation collect2: error: ld returned 1 exit status make[2]: *** [oshmem_info] Error 1 Is rhel7 not supported with OpenMPI? saurabh
Re: [OMPI users] Problems running 1.8.8 and compiling 1.10.1 on Redhat EL7
> From: Jeff Squyres (jsquyres) (jsquyres_at_[hidden]) > Date: 2015-11-06 18:02:42 > > Both of these seem to be issues with libnl, which is a dependent library > that Open MPI uses. Based on your email, I found this message and thread: https://www.open-mpi.org/community/lists/devel/2015/08/17812.php which says the problem is with a conflict between libnl and libnl3, and gives a workaround, i.e. to use --without-verbs during configure. Both my cases work with this option. Thank you. Sorry for the copy paste, I did not enable mail delivery. saurabh > From: saur...@hotmail.com > To: us...@open-mpi.org > Subject: Problems running 1.8.8 and compiling 1.10.1 on Redhat EL7 > Date: Fri, 6 Nov 2015 17:44:06 -0500 > > Hi, > > On Redhat Enterprise Linux 7, I am facing the following problems. > > 1. With OpenMPI 1.8.8, everything builds, but the following error appears on > running: > orterun -np 2 hello_cxx > hello_cxx: route/tc.c:973: rtnl_tc_register: Assertion `0' failed. > hello_cxx: route/tc.c:973: rtnl_tc_register: Assertion `0' failed. > -- > orterun noticed that process rank 0 with PID 18229 on node sim18 exited on > signal 6 (Aborted). > -- > > 2. With OpenMPI 1.10.1, there is a failure to compile oshmem_info: > /bin/ld: ../../../oshmem/.libs/liboshmem.a(memheap_base_static.o): undefined > reference to symbol '_end' > /bin/ld: note: '_end' is defined in DSO /lib64/libnl-route-3.so.200 so try > adding it to the linker command line > /lib64/libnl-route-3.so.200: could not read symbols: Invalid operation > collect2: error: ld returned 1 exit status > make[2]: *** [oshmem_info] Error 1 > > Is rhel7 not supported with OpenMPI? > > saurabh >
[OMPI users] Propagate current shell's environment
Hi, Is there any way with OpenMPI to propagate the current shell's environment to the parallel program? I am looking for an equivalent way to how MPICH handles environment variables (https://wiki.mpich.org/mpich/index.php/Frequently_Asked_Questions#Q:_How_do_I_pass_environment_variables_to_the_processes_of_my_parallel_program_when_using_the_mpd.2C_hydra_or_gforker_process_manager.3F): > By default, all the environment variables in the shell where mpiexec is run are passed to all processes of the application program. OpenMPI has the parallel processes read bashrc so the environment can be different for different processes, which is exactly what I want to avoid. I could not find any way of doing this in orterun --help or on the forums. Thank you. saurabh
Re: [OMPI users] Propagate current shell's environment
I meant different from the current shell, not different for different processes, sorry. Also I am aware of -x but it's not the right solution in this case because (a) it's manual (b) it appears that anything set in bashrc that was unset in the shell would be set for the program which I do not want. > From: saur...@hotmail.com > To: us...@open-mpi.org > Subject: Propagate current shell's environment > Date: Mon, 9 Nov 2015 11:40:13 -0500 > > Hi, > > Is there any way with OpenMPI to propagate the current shell's environment to > the parallel program? I am looking for an equivalent way to how MPICH handles > environment variables > (https://wiki.mpich.org/mpich/index.php/Frequently_Asked_Questions#Q:_How_do_I_pass_environment_variables_to_the_processes_of_my_parallel_program_when_using_the_mpd.2C_hydra_or_gforker_process_manager.3F): > >> By default, all the environment variables in the shell where mpiexec is > run are passed to all processes of the application program. > > OpenMPI has the parallel processes read bashrc so the environment can be > different for different processes, which is exactly what I want to avoid. I > could not find any way of doing this in orterun --help or on the forums. > > Thank you. > > saurabh >
Re: [OMPI users] Propagate current shell's environment
I'd appreciate a response, even a simple no if this is not possible. Thank you. saurabh > From: saur...@hotmail.com > To: us...@open-mpi.org > Subject: RE: Propagate current shell's environment > Date: Mon, 9 Nov 2015 11:45:07 -0500 > > I meant different from the current shell, not different for different > processes, sorry. > Also I am aware of -x but it's not the right solution in this case because > (a) it's manual (b) it appears that anything set in bashrc that was unset in > the shell would be set for the program which I do not want. > > > > From: saur...@hotmail.com > > To: us...@open-mpi.org > > Subject: Propagate current shell's environment > > Date: Mon, 9 Nov 2015 11:40:13 -0500 > > > > Hi, > > > > Is there any way with OpenMPI to propagate the current shell's environment > > to the parallel program? I am looking for an equivalent way to how MPICH > > handles environment variables > > (https://wiki.mpich.org/mpich/index.php/Frequently_Asked_Questions#Q:_How_do_I_pass_environment_variables_to_the_processes_of_my_parallel_program_when_using_the_mpd.2C_hydra_or_gforker_process_manager.3F): > > > >> By default, all the environment variables in the shell where mpiexec is > > run are passed to all processes of the application program. > > > > OpenMPI has the parallel processes read bashrc so the environment can be > > different for different processes, which is exactly what I want to avoid. I > > could not find any way of doing this in orterun --help or on the forums. > > > > Thank you. > > > > saurabh > > > This email has been sent from a virus-free computer protected by Avast. www.avast.com
[OMPI users] OpenMPI 1.10.1 crashes with file size limit <= 131072
Here's what I find: > cd examples > make hello_cxx > ulimit -f 131073 > orterun -np 3 hello_cxxHello, world! [Etc] > ulimit -f 131072 > orterun -np 3 hello_cxx
[OMPI users] Openmpi 1.10.1 fails with SIGXFSZ on file limit <= 131072
Hi, Sorry my previous email was garbled, sending it again. > cd examples > make hello_cxx > ulimit -f 131073 > orterun -np 3 hello_cxx Hello, world (etc) > ulimit -f 131072 > orterun -np 3 hello_cxx -- orterun noticed that process rank 0 with PID 4473 on node sim16 exited on signal 25 (File size limit exceeded). -- Any thoughts?
Re: [OMPI users] Openmpi 1.10.1 fails with SIGXFSZ on file limit <= 131072
An "strace" showed something related to shared memory use was causing the signal. Sticking btl = ^sm into the openmpi-mca-params.conf file fixed this issue. saurabh From: saur...@hotmail.com To: us...@open-mpi.org Subject: Openmpi 1.10.1 fails with SIGXFSZ on file limit <= 131072 List-Post: users@lists.open-mpi.org Date: Thu, 19 Nov 2015 15:24:08 -0500 Hi, Sorry my previous email was garbled, sending it again. > cd examples > make hello_cxx > ulimit -f 131073 > orterun -np 3 hello_cxx Hello, world (etc) > ulimit -f 131072 > orterun -np 3 hello_cxx -- orterun noticed that process rank 0 with PID 4473 on node sim16 exited on signal 25 (File size limit exceeded). -- Any thoughts?
Re: [OMPI users] Openmpi 1.10.1 fails with SIGXFSZ on file limit <= 131072
> Could you please provide a little more info regarding the environment you > are running under (which resource mgr or not, etc), how many nodes you had > in the allocation, etc? > There is no reason why something should behave that way. So it would help > if we could understand the setup. > Ralph To answer Ralph's above question on the other thread, all nodes are on the same machine orterun was run on. It's a redhat 7 64-bit gcc 4.8 install of openmpi 1.10.1. The only atypical thing is that btl_tcp_if_exclude = virbr0 has been added to openmpi-mca-params.conf based on some failures I was seeing before. (And now of course I've added btl = ^sm as well to fix this issue, see my other response). Relevant output from strace (without the btl = ^sm) is below. Stuff in square brackets are my minor edits and snips. open("/tmp/openmpi-sessions-[user]@[host]_0/40072/1/1/vader_segment.[host].1", O_RDWR|O_CREAT, 0600) = 12 ftruncate(12, 4194312) = 0 mmap(NULL, 4194312, PROT_READ|PROT_WRITE, MAP_SHARED, 12, 0) = 0x7fe506c8a000 close(12) = 0 write(9, "\1\0\0\0\0\0\0\0", 8) = 8 [...] poll([{fd=5, events=POLLIN}, {fd=11, events=POLLIN}], 2, 0)= -1 EFBIG (File too large) --- SIGXFSZ {si_signo=SIGXFSZ, si_code=SI_USER, si_pid=12329, si_uid=1005} --- -- From: saur...@hotmail.com To: us...@open-mpi.org Subject: Openmpi 1.10.1 fails with SIGXFSZ on file limit <= 131072 List-Post: users@lists.open-mpi.org Date: Thu, 19 Nov 2015 15:24:08 -0500 Hi, Sorry my previous email was garbled, sending it again. > cd examples > make hello_cxx > ulimit -f 131073 > orterun -np 3 hello_cxx Hello, world (etc) > ulimit -f 131072 > orterun -np 3 hello_cxx -- orterun noticed that process rank 0 with PID 4473 on node sim16 exited on signal 25 (File size limit exceeded). -- Any thoughts?
Re: [OMPI users] Openmpi 1.10.1 fails with SIGXFSZ on file limit <= 131072
I apologize, I have the wrong lines from strace for the initial file there (of course). The file with fd = 11 which causes the problem is called shared_mem_pool.[host] and fruncate(11, 134217736) is called on it. (This is exactly 1024 times the ulimit of 131072 which makes sense as the ulimit is in 1K blocks). From: saur...@hotmail.com To: us...@open-mpi.org Subject: RE: Openmpi 1.10.1 fails with SIGXFSZ on file limit <= 131072 List-Post: users@lists.open-mpi.org Date: Thu, 19 Nov 2015 17:08:22 -0500 > Could you please provide a little more info regarding the environment you > are running under (which resource mgr or not, etc), how many nodes you had > in the allocation, etc? > There is no reason why something should behave that way. So it would help > if we could understand the setup. > Ralph To answer Ralph's above question on the other thread, all nodes are on the same machine orterun was run on. It's a redhat 7 64-bit gcc 4.8 install of openmpi 1.10.1. The only atypical thing is that btl_tcp_if_exclude = virbr0 has been added to openmpi-mca-params.conf based on some failures I was seeing before. (And now of course I've added btl = ^sm as well to fix this issue, see my other response). Relevant output from strace (without the btl = ^sm) is below. Stuff in square brackets are my minor edits and snips. open("/tmp/openmpi-sessions-[user]@[host]_0/40072/1/1/vader_segment.[host].1", O_RDWR|O_CREAT, 0600) = 12 ftruncate(12, 4194312) = 0 mmap(NULL, 4194312, PROT_READ|PROT_WRITE, MAP_SHARED, 12, 0) = 0x7fe506c8a000 close(12) = 0 write(9, "\1\0\0\0\0\0\0\0", 8) = 8 [...] poll([{fd=5, events=POLLIN}, {fd=11, events=POLLIN}], 2, 0)= -1 EFBIG (File too large) --- SIGXFSZ {si_signo=SIGXFSZ, si_code=SI_USER, si_pid=12329, si_uid=1005} --- -- From: saur...@hotmail.com To: us...@open-mpi.org Subject: Openmpi 1.10.1 fails with SIGXFSZ on file limit <= 131072 List-Post: users@lists.open-mpi.org Date: Thu, 19 Nov 2015 15:24:08 -0500 Hi, Sorry my previous email was garbled, sending it again. > cd examples > make hello_cxx > ulimit -f 131073 > orterun -np 3 hello_cxx Hello, world (etc) > ulimit -f 131072 > orterun -np 3 hello_cxx -- orterun noticed that process rank 0 with PID 4473 on node sim16 exited on signal 25 (File size limit exceeded). -- Any thoughts?
Re: [OMPI users] Openmpi 1.10.1 fails with SIGXFSZ on file limit <= 131072
> For what it's worth, that's open MPI creating a chunk of shared memory for use with on-server > communication. It shows up as a "file", but it's really shared memory. > You can disable sm and/or Vader, but your on-server message passing > performance will be significantly > lower. > Is there a reason you have a file size limit? The file size limit is so our testing does not write runaway large files. I'm not satisfied that the solution would be to just throw a better error. This to me looks like a bug in openmpi. If it is actually shared memory, it shouldnt be constrained by file size limit. saurabh From: saur...@hotmail.com To: us...@open-mpi.org Subject: RE: Openmpi 1.10.1 fails with SIGXFSZ on file limit <= 131072 List-Post: users@lists.open-mpi.org Date: Thu, 19 Nov 2015 17:32:36 -0500 I apologize, I have the wrong lines from strace for the initial file there (of course). The file with fd = 11 which causes the problem is called shared_mem_pool.[host] and fruncate(11, 134217736) is called on it. (This is exactly 1024 times the ulimit of 131072 which makes sense as the ulimit is in 1K blocks). From: saur...@hotmail.com To: us...@open-mpi.org Subject: RE: Openmpi 1.10.1 fails with SIGXFSZ on file limit <= 131072 List-Post: users@lists.open-mpi.org Date: Thu, 19 Nov 2015 17:08:22 -0500 > Could you please provide a little more info regarding the environment you > are running under (which resource mgr or not, etc), how many nodes you had > in the allocation, etc? > There is no reason why something should behave that way. So it would help > if we could understand the setup. > Ralph To answer Ralph's above question on the other thread, all nodes are on the same machine orterun was run on. It's a redhat 7 64-bit gcc 4.8 install of openmpi 1.10.1. The only atypical thing is that btl_tcp_if_exclude = virbr0 has been added to openmpi-mca-params.conf based on some failures I was seeing before. (And now of course I've added btl = ^sm as well to fix this issue, see my other response). Relevant output from strace (without the btl = ^sm) is below. Stuff in square brackets are my minor edits and snips. open("/tmp/openmpi-sessions-[user]@[host]_0/40072/1/1/vader_segment.[host].1", O_RDWR|O_CREAT, 0600) = 12 ftruncate(12, 4194312) = 0 mmap(NULL, 4194312, PROT_READ|PROT_WRITE, MAP_SHARED, 12, 0) = 0x7fe506c8a000 close(12) = 0 write(9, "\1\0\0\0\0\0\0\0", 8) = 8 [...] poll([{fd=5, events=POLLIN}, {fd=11, events=POLLIN}], 2, 0)= -1 EFBIG (File too large) --- SIGXFSZ {si_signo=SIGXFSZ, si_code=SI_USER, si_pid=12329, si_uid=1005} --- -- From: saur...@hotmail.com To: us...@open-mpi.org Subject: Openmpi 1.10.1 fails with SIGXFSZ on file limit <= 131072 List-Post: users@lists.open-mpi.org Date: Thu, 19 Nov 2015 15:24:08 -0500 Hi, Sorry my previous email was garbled, sending it again. > cd examples > make hello_cxx > ulimit -f 131073 > orterun -np 3 hello_cxx Hello, world (etc) > ulimit -f 131072 > orterun -np 3 hello_cxx -- orterun noticed that process rank 0 with PID 4473 on node sim16 exited on signal 25 (File size limit exceeded). -- Any thoughts?
[OMPI users] Possible to exclude a hwloc_base_binding_policy?
Hi, Switching to OpenMPI 3, I was getting error messages of the form "No objects of the specified type were found on at least one node: Type: NUMANode ... ORTE has lost communication with a remote daemon. ..." After some research, I found that hwloc_base_binding_policy (for np > 2) switched to numa for OpenMPI v3 from socket for v2. This is seen from "ompi_info --param all all --level 9". I've verified the switch to numa is causing the failures. If I set it to socket, it works. My question is, how can I set the variable in openmpi-mca-params.conf to exclude numa, ie. use whatever its rules are, except numa. I tried "hwloc_base_binding_policy = ^numa" (similar to say "btl = ^sm") but this didnt work. Is what I want possible, or should I live with socket policy for all cases? Thank you. saurabh ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users
[OMPI users] Memory leak with pmix_finalize not being called
This is with valgrind 3.0.1 on a Centos 6 system. It appears pmix_finalize isnt called and this reports leaks from valgrind despite the provided suppression file being used. A cursory check reveals MPI_Finalize calls pmix_rte_finalize which decrements pmix_initialized to 0 before calling pmix_cleanup. pmix_cleanup sees the variable is 0 and does not call pmix_finalize. See https://github.com/pmix/pmix/blob/master/src/runtime/pmix_finalize.c. Is this a bug or something I am doing wrong? Thank you. ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users
[OMPI users] Re: Avoiding localhost as rank 0 with openmpi-default-hostfile
I asked this before but did not receive a reply. Now with openmpi 5, I tried doing this with prte-default-hostfile and rmaps_default_mapping_policy = node:OVERSUBSCRIBE but I still get the same behavior: openmpi always wants rank 0 to be localhost. Is there a way to override this and set ranks by machine order in hostfile? Thanks. From: Saurabh T Sent: Monday, November 6, 2023 10:44 AM To: Openmpi Subject: Avoiding localhost as rank 0 with openmpi-default-hostfile My openmpi-default-hostfile has host1 slots=4 host2 slots=4 host0 slots=4 and my openmpi-mca-params.conf has rmaps_base_mapping_policy = node rmaps_base_oversubscribe = 1 If I invoke orterun -np 3 on host0, it puts rank0 on host0, rank1 on host1, rank2 on host2. I want it to put rank0 on host1, rank1 on host2, rank2 on host0 (as specified in the host file). I cannot use nolocal because I do want it to run on host0. How can localhost being rank0 be avoided without using -H? Thanks, saurabh To unsubscribe from this group and stop receiving emails from it, send an email to users+unsubscr...@lists.open-mpi.org.
[OMPI users] Avoiding localhost as rank 0 with openmpi-default-hostfile
My openmpi-default-hostfile has host1 slots=4 host2 slots=4 host0 slots=4 and my openmpi-mca-params.conf has rmaps_base_mapping_policy = node rmaps_base_oversubscribe = 1 If I invoke orterun -np 3 on host0, it puts rank0 on host0, rank1 on host1, rank2 on host2. I want it to put rank0 on host1, rank1 on host2, rank2 on host0 (as specified in the host file). I cannot use nolocal because I do want it to run on host0. How can localhost being rank0 be avoided without using -H? Thanks, saurabh