[OMPI users] How to override default hostfile to specify host

2011-10-28 Thread Saurabh T

Hi,

If I use "orterun -H " and  does not belong in the default 
hostfile ("etc/openmpi-default-hostfile"), openmpi gives an error. Is there an 
easy way to get the aforementioned command to work without specifying a 
different hostfile with  in it? Thank you.



[OMPI users] OpenMPI w valgrind: need to recompile?

2010-01-06 Thread Saurabh T

Hi,

I am building libraries against OpenMPI, and then applications using those 
libraries. 

It was unclear from the FAQ at 
http://www.open-mpi.org/faq/?category=debugging#memchecker_how whether the 
libraries need to be recompiled and the application relinked using 
valgrind-enabled mpicc etc, in order to get valgrind to work. In other words, 
can I run a valgrind-disabled openmpi app with a valgrind-enabled orterun, or 
do I have to recompile/relink the whole thing? Is the answer different for 
shared vs static openmpi libraries?

The FAQ also states that openmpi from v 1.5 provides a valgrind suppression 
file. Is this a mistake in the FAQ or is the suppression file not available 
with the latest stable release (1.4)? If not, can the 1.5 file be used with 1.4?

Thanks,
saurabh
  
_
Windows Live: Friends get your Flickr, Yelp, and Digg updates when they e-mail 
you.
http://www.microsoft.com/middleeast/windows/windowslive/see-it-in-action/social-network-basics.aspx?ocid=PID23461::T:WLMTAGL:ON:WL:en-xm:SI_SB_3:092010

[OMPI users] Problems running 1.8.8 and compiling 1.10.1 on Redhat EL7

2015-11-06 Thread Saurabh T
Hi,

On Redhat Enterprise Linux 7, I am facing the following problems.

1. With OpenMPI 1.8.8, everything builds, but the following error appears on 
running:
orterun -np 2 hello_cxx
hello_cxx: route/tc.c:973: rtnl_tc_register: Assertion `0' failed.
hello_cxx: route/tc.c:973: rtnl_tc_register: Assertion `0' failed.
--
orterun noticed that process rank 0 with PID 18229 on node sim18 exited on 
signal 6 (Aborted).
--

2. With OpenMPI 1.10.1, there is a failure to compile oshmem_info:
/bin/ld: ../../../oshmem/.libs/liboshmem.a(memheap_base_static.o): undefined 
reference to symbol '_end'
/bin/ld: note: '_end' is defined in DSO /lib64/libnl-route-3.so.200 so try 
adding it to the linker command line
/lib64/libnl-route-3.so.200: could not read symbols: Invalid operation
collect2: error: ld returned 1 exit status
make[2]: *** [oshmem_info] Error 1

Is rhel7 not supported with OpenMPI?

saurabh
  

Re: [OMPI users] Problems running 1.8.8 and compiling 1.10.1 on Redhat EL7

2015-11-06 Thread Saurabh T
> From: Jeff Squyres (jsquyres) (jsquyres_at_[hidden])
> Date: 2015-11-06 18:02:42
>
> Both of these seem to be issues with libnl, which is a dependent library
> that Open MPI uses.

Based on your email, I found this message and thread:
https://www.open-mpi.org/community/lists/devel/2015/08/17812.php
which says the problem is with a conflict between libnl and libnl3, and gives a 
workaround, i.e. to use
--without-verbs
during configure. Both my cases work with this option. Thank you.

Sorry for the copy paste, I did not enable mail delivery.

saurabh



> From: saur...@hotmail.com
> To: us...@open-mpi.org
> Subject: Problems running 1.8.8 and compiling 1.10.1 on Redhat EL7
> Date: Fri, 6 Nov 2015 17:44:06 -0500
>
> Hi,
>
> On Redhat Enterprise Linux 7, I am facing the following problems.
>
> 1. With OpenMPI 1.8.8, everything builds, but the following error appears on 
> running:
> orterun -np 2 hello_cxx
> hello_cxx: route/tc.c:973: rtnl_tc_register: Assertion `0' failed.
> hello_cxx: route/tc.c:973: rtnl_tc_register: Assertion `0' failed.
> --
> orterun noticed that process rank 0 with PID 18229 on node sim18 exited on 
> signal 6 (Aborted).
> --
>
> 2. With OpenMPI 1.10.1, there is a failure to compile oshmem_info:
> /bin/ld: ../../../oshmem/.libs/liboshmem.a(memheap_base_static.o): undefined 
> reference to symbol '_end'
> /bin/ld: note: '_end' is defined in DSO /lib64/libnl-route-3.so.200 so try 
> adding it to the linker command line
> /lib64/libnl-route-3.so.200: could not read symbols: Invalid operation
> collect2: error: ld returned 1 exit status
> make[2]: *** [oshmem_info] Error 1
>
> Is rhel7 not supported with OpenMPI?
>
> saurabh
>
  

[OMPI users] Propagate current shell's environment

2015-11-09 Thread Saurabh T
Hi,

Is there any way with OpenMPI to propagate the current shell's environment to 
the parallel program? I am looking for an equivalent way to how MPICH handles 
environment variables 
(https://wiki.mpich.org/mpich/index.php/Frequently_Asked_Questions#Q:_How_do_I_pass_environment_variables_to_the_processes_of_my_parallel_program_when_using_the_mpd.2C_hydra_or_gforker_process_manager.3F):

> By default, all the environment variables in the shell where mpiexec is 
run are passed to all processes of the application program.

OpenMPI has the parallel processes read bashrc so the environment can be 
different for different processes, which is exactly what I want to avoid. I 
could not find any way of doing this in orterun --help or on the forums.

Thank you.

saurabh
  

Re: [OMPI users] Propagate current shell's environment

2015-11-09 Thread Saurabh T
I meant different from the current shell, not different for different 
processes, sorry.
Also I am aware of -x but it's not the right solution in this case because (a) 
it's manual (b) it appears that anything set in bashrc that was unset in the 
shell would be set for the program which I do not want.


> From: saur...@hotmail.com
> To: us...@open-mpi.org
> Subject: Propagate current shell's environment
> Date: Mon, 9 Nov 2015 11:40:13 -0500
>
> Hi,
>
> Is there any way with OpenMPI to propagate the current shell's environment to 
> the parallel program? I am looking for an equivalent way to how MPICH handles 
> environment variables 
> (https://wiki.mpich.org/mpich/index.php/Frequently_Asked_Questions#Q:_How_do_I_pass_environment_variables_to_the_processes_of_my_parallel_program_when_using_the_mpd.2C_hydra_or_gforker_process_manager.3F):
>
>> By default, all the environment variables in the shell where mpiexec is
> run are passed to all processes of the application program.
>
> OpenMPI has the parallel processes read bashrc so the environment can be 
> different for different processes, which is exactly what I want to avoid. I 
> could not find any way of doing this in orterun --help or on the forums.
>
> Thank you.
>
> saurabh
>
  

Re: [OMPI users] Propagate current shell's environment

2015-11-13 Thread Saurabh T
I'd appreciate a response, even a simple no if this is not possible. Thank you.
saurabh

> From: saur...@hotmail.com
> To: us...@open-mpi.org
> Subject: RE: Propagate current shell's environment
> Date: Mon, 9 Nov 2015 11:45:07 -0500
> 
> I meant different from the current shell, not different for different 
> processes, sorry.
> Also I am aware of -x but it's not the right solution in this case because 
> (a) it's manual (b) it appears that anything set in bashrc that was unset in 
> the shell would be set for the program which I do not want.
> 
> 
> > From: saur...@hotmail.com
> > To: us...@open-mpi.org
> > Subject: Propagate current shell's environment
> > Date: Mon, 9 Nov 2015 11:40:13 -0500
> >
> > Hi,
> >
> > Is there any way with OpenMPI to propagate the current shell's environment 
> > to the parallel program? I am looking for an equivalent way to how MPICH 
> > handles environment variables 
> > (https://wiki.mpich.org/mpich/index.php/Frequently_Asked_Questions#Q:_How_do_I_pass_environment_variables_to_the_processes_of_my_parallel_program_when_using_the_mpd.2C_hydra_or_gforker_process_manager.3F):
> >
> >> By default, all the environment variables in the shell where mpiexec is
> > run are passed to all processes of the application program.
> >
> > OpenMPI has the parallel processes read bashrc so the environment can be 
> > different for different processes, which is exactly what I want to avoid. I 
> > could not find any way of doing this in orterun --help or on the forums.
> >
> > Thank you.
> >
> > saurabh
> >
> 





This email has been sent from a virus-free computer protected 
by Avast. www.avast.com


  

[OMPI users] OpenMPI 1.10.1 crashes with file size limit <= 131072

2015-11-19 Thread Saurabh T
Here's what I find:

> cd examples
> make hello_cxx
> ulimit -f 131073

> orterun -np 3 hello_cxxHello, world! 
[Etc]

> ulimit -f 131072

> orterun -np 3 hello_cxx

  

[OMPI users] Openmpi 1.10.1 fails with SIGXFSZ on file limit <= 131072

2015-11-19 Thread Saurabh T
Hi,

Sorry my previous email was garbled, sending it again.

> cd examples
> make hello_cxx

> ulimit -f 131073
> orterun -np 3 hello_cxx
Hello, world
(etc)

> ulimit -f 131072
> orterun -np 3 hello_cxx
--
orterun noticed that process rank 0 with PID 4473 on node sim16 exited on 
signal 25 (File size limit exceeded).
--

Any thoughts? 


  

Re: [OMPI users] Openmpi 1.10.1 fails with SIGXFSZ on file limit <= 131072

2015-11-19 Thread Saurabh T
An "strace" showed something related to shared memory use was causing the 
signal. Sticking

btl = ^sm

into the openmpi-mca-params.conf file fixed this issue.

saurabh

From: saur...@hotmail.com
To: us...@open-mpi.org
Subject: Openmpi 1.10.1 fails with SIGXFSZ on file limit <= 131072
List-Post: users@lists.open-mpi.org
Date: Thu, 19 Nov 2015 15:24:08 -0500




Hi,

Sorry my previous email was garbled, sending it again.

> cd examples
> make hello_cxx

> ulimit -f 131073
> orterun -np 3 hello_cxx
Hello, world
(etc)

> ulimit -f 131072
> orterun -np 3 hello_cxx
--
orterun noticed that process rank 0 with PID 4473 on node sim16 exited on 
signal 25 (File size limit exceeded).
--

Any thoughts? 



  

Re: [OMPI users] Openmpi 1.10.1 fails with SIGXFSZ on file limit <= 131072

2015-11-19 Thread Saurabh T
> Could you please provide a little more info regarding the environment you

> are running under (which resource mgr or not, etc), how many nodes you had


> in the allocation, etc?



> There is no reason why something should behave that way. So it would help


> if we could understand the setup.


> Ralph


To answer Ralph's above question on the other thread, all nodes are  on the 
same machine orterun was run on. It's a redhat 7 64-bit gcc 4.8 install of 
openmpi 1.10.1. The only atypical thing is that
btl_tcp_if_exclude = virbr0
has been added to openmpi-mca-params.conf based on some failures I was seeing 
before.
(And now of course I've added btl = ^sm as well to fix this issue, see my other 
response).

Relevant output from strace (without the btl = ^sm) is below. Stuff in square 
brackets are my minor edits and snips.

open("/tmp/openmpi-sessions-[user]@[host]_0/40072/1/1/vader_segment.[host].1", 
O_RDWR|O_CREAT, 0600) = 12
ftruncate(12, 4194312)  = 0
mmap(NULL, 4194312, PROT_READ|PROT_WRITE, MAP_SHARED, 12, 0) = 0x7fe506c8a000
close(12)   = 0
write(9, "\1\0\0\0\0\0\0\0", 8) = 8
[...]
poll([{fd=5, events=POLLIN}, {fd=11, events=POLLIN}], 2, 0)= -1 
EFBIG (File too large)
--- SIGXFSZ {si_signo=SIGXFSZ, si_code=SI_USER, si_pid=12329, si_uid=1005} ---
--

From: saur...@hotmail.com
To: us...@open-mpi.org
Subject: Openmpi 1.10.1 fails with SIGXFSZ on file limit <= 131072
List-Post: users@lists.open-mpi.org
Date: Thu, 19 Nov 2015 15:24:08 -0500




Hi,

Sorry my previous email was garbled, sending it again.

> cd examples
> make hello_cxx

> ulimit -f 131073
> orterun -np 3 hello_cxx
Hello, world
(etc)

> ulimit -f 131072
> orterun -np 3 hello_cxx
--
orterun noticed that process rank 0 with PID 4473 on node sim16 exited on 
signal 25 (File size limit exceeded).
--

Any thoughts? 



  

Re: [OMPI users] Openmpi 1.10.1 fails with SIGXFSZ on file limit <= 131072

2015-11-19 Thread Saurabh T
I apologize, I have the wrong lines from strace for the initial file there (of 
course). The file with fd = 11 which causes the problem is called 
shared_mem_pool.[host] and fruncate(11, 134217736) is called on it. (This is 
exactly 1024 times the ulimit of 131072 which makes sense as the ulimit is in 
1K blocks).


From: saur...@hotmail.com
To: us...@open-mpi.org
Subject: RE: Openmpi 1.10.1 fails with SIGXFSZ on file limit <= 131072
List-Post: users@lists.open-mpi.org
Date: Thu, 19 Nov 2015 17:08:22 -0500




> Could you please provide a little more info regarding the environment you

> are running under (which resource mgr or not, etc), how many nodes you had


> in the allocation, etc?



> There is no reason why something should behave that way. So it would help


> if we could understand the setup.


> Ralph


To answer Ralph's above question on the other thread, all nodes are  on the 
same machine orterun was run on. It's a redhat 7 64-bit gcc 4.8 install of 
openmpi 1.10.1. The only atypical thing is that
btl_tcp_if_exclude = virbr0
has been added to openmpi-mca-params.conf based on some failures I was seeing 
before.
(And now of course I've added btl = ^sm as well to fix this issue, see my other 
response).

Relevant output from strace (without the btl = ^sm) is below. Stuff in square 
brackets are my minor edits and snips.

open("/tmp/openmpi-sessions-[user]@[host]_0/40072/1/1/vader_segment.[host].1", 
O_RDWR|O_CREAT, 0600) = 12
ftruncate(12, 4194312)  = 0
mmap(NULL, 4194312, PROT_READ|PROT_WRITE, MAP_SHARED, 12, 0) = 0x7fe506c8a000
close(12)   = 0
write(9, "\1\0\0\0\0\0\0\0", 8) = 8
[...]
poll([{fd=5, events=POLLIN}, {fd=11, events=POLLIN}], 2, 0)= -1 
EFBIG (File too large)
--- SIGXFSZ {si_signo=SIGXFSZ, si_code=SI_USER, si_pid=12329, si_uid=1005} ---
--

From: saur...@hotmail.com
To: us...@open-mpi.org
Subject: Openmpi 1.10.1 fails with SIGXFSZ on file limit <= 131072
List-Post: users@lists.open-mpi.org
Date: Thu, 19 Nov 2015 15:24:08 -0500




Hi,

Sorry my previous email was garbled, sending it again.

> cd examples
> make hello_cxx

> ulimit -f 131073
> orterun -np 3 hello_cxx
Hello, world
(etc)

> ulimit -f 131072
> orterun -np 3 hello_cxx
--
orterun noticed that process rank 0 with PID 4473 on node sim16 exited on 
signal 25 (File size limit exceeded).
--

Any thoughts? 



  

Re: [OMPI users] Openmpi 1.10.1 fails with SIGXFSZ on file limit <= 131072

2015-11-20 Thread Saurabh T

> For what it's worth, that's open MPI creating a chunk of shared memory 
for use with on-server
> communication. It shows up as a "file", but it's 
really shared memory.



> You can disable sm and/or Vader, but your on-server message passing 
> performance will be significantly
> lower.



> Is there a reason you have a file size limit?


The file size limit is so our testing does not write runaway large files. I'm 
not satisfied that the solution would be to just throw a better error. This to 
me looks like a bug in openmpi. If it is actually shared memory, it shouldnt be 
constrained by file size limit.

saurabh

From: saur...@hotmail.com
To: us...@open-mpi.org
Subject: RE: Openmpi 1.10.1 fails with SIGXFSZ on file limit <= 131072
List-Post: users@lists.open-mpi.org
Date: Thu, 19 Nov 2015 17:32:36 -0500




I apologize, I have the wrong lines from strace for the initial file there (of 
course). The file with fd = 11 which causes the problem is called 
shared_mem_pool.[host] and fruncate(11, 134217736) is called on it. (This is 
exactly 1024 times the ulimit of 131072 which makes sense as the ulimit is in 
1K blocks).


From: saur...@hotmail.com
To: us...@open-mpi.org
Subject: RE: Openmpi 1.10.1 fails with SIGXFSZ on file limit <= 131072
List-Post: users@lists.open-mpi.org
Date: Thu, 19 Nov 2015 17:08:22 -0500




> Could you please provide a little more info regarding the environment you

> are running under (which resource mgr or not, etc), how many nodes you had


> in the allocation, etc?



> There is no reason why something should behave that way. So it would help


> if we could understand the setup.


> Ralph


To answer Ralph's above question on the other thread, all nodes are  on the 
same machine orterun was run on. It's a redhat 7 64-bit gcc 4.8 install of 
openmpi 1.10.1. The only atypical thing is that
btl_tcp_if_exclude = virbr0
has been added to openmpi-mca-params.conf based on some failures I was seeing 
before.
(And now of course I've added btl = ^sm as well to fix this issue, see my other 
response).

Relevant output from strace (without the btl = ^sm) is below. Stuff in square 
brackets are my minor edits and snips.

open("/tmp/openmpi-sessions-[user]@[host]_0/40072/1/1/vader_segment.[host].1", 
O_RDWR|O_CREAT, 0600) = 12
ftruncate(12, 4194312)  = 0
mmap(NULL, 4194312, PROT_READ|PROT_WRITE, MAP_SHARED, 12, 0) = 0x7fe506c8a000
close(12)   = 0
write(9, "\1\0\0\0\0\0\0\0", 8) = 8
[...]
poll([{fd=5, events=POLLIN}, {fd=11, events=POLLIN}], 2, 0)= -1 
EFBIG (File too large)
--- SIGXFSZ {si_signo=SIGXFSZ, si_code=SI_USER, si_pid=12329, si_uid=1005} ---
--

From: saur...@hotmail.com
To: us...@open-mpi.org
Subject: Openmpi 1.10.1 fails with SIGXFSZ on file limit <= 131072
List-Post: users@lists.open-mpi.org
Date: Thu, 19 Nov 2015 15:24:08 -0500




Hi,

Sorry my previous email was garbled, sending it again.

> cd examples
> make hello_cxx

> ulimit -f 131073
> orterun -np 3 hello_cxx
Hello, world
(etc)

> ulimit -f 131072
> orterun -np 3 hello_cxx
--
orterun noticed that process rank 0 with PID 4473 on node sim16 exited on 
signal 25 (File size limit exceeded).
--

Any thoughts? 




  

[OMPI users] Possible to exclude a hwloc_base_binding_policy?

2018-04-20 Thread Saurabh T
Hi,
Switching to OpenMPI 3, I was getting error messages of the form 
"No objects of the specified type were found on at least one node:
Type: NUMANode
...
ORTE has lost communication with a remote daemon.
..."

After some research, I found that hwloc_base_binding_policy (for np >  2) 
switched to numa for OpenMPI v3 from socket for v2. This is seen  from 
"ompi_info --param all all --level 9". I've verified the switch to  numa is 
causing the failures. If I set it to socket, it works.

My question is, how can I set the variable in openmpi-mca-params.conf to  
exclude numa, ie. use whatever its rules are, except numa. I tried  
"hwloc_base_binding_policy = ^numa" (similar to say "btl = ^sm") but  this 
didnt work. Is what I want possible, or should I live with socket  policy for 
all cases?  

Thank you.
saurabh
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


[OMPI users] Memory leak with pmix_finalize not being called

2018-05-04 Thread Saurabh T
This is with valgrind 3.0.1 on a Centos 6 system. It appears pmix_finalize isnt 
called and this reports leaks from valgrind despite the provided suppression 
file being used. A cursory check reveals MPI_Finalize calls pmix_rte_finalize 
which decrements pmix_initialized to 0 before calling pmix_cleanup. 
pmix_cleanup sees the variable is 0 and does not call pmix_finalize. See 
https://github.com/pmix/pmix/blob/master/src/runtime/pmix_finalize.c.  Is this 
a bug or something I am doing wrong? Thank you.

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


[OMPI users] Re: Avoiding localhost as rank 0 with openmpi-default-hostfile

2025-02-27 Thread Saurabh T
I asked this before but did not receive a reply. Now with openmpi 5, I tried 
doing this with prte-default-hostfile and rmaps_default_mapping_policy = 
node:OVERSUBSCRIBE but I still get the same behavior: openmpi always wants rank 
0 to be localhost. Is there a way to override this and set ranks by machine 
order in hostfile? Thanks.

From: Saurabh T
Sent: Monday, November 6, 2023 10:44 AM
To: Openmpi 
Subject: Avoiding localhost as rank 0 with openmpi-default-hostfile

My openmpi-default-hostfile has
host1 slots=4
host2 slots=4
host0 slots=4

and my openmpi-mca-params.conf has
rmaps_base_mapping_policy = node
rmaps_base_oversubscribe = 1

If I invoke orterun -np 3 on host0, it puts rank0 on host0, rank1 on host1, 
rank2 on host2. I want it to put rank0 on host1, rank1 on host2, rank2 on host0 
(as specified in the host file). I cannot use nolocal because I do want it to 
run on host0.

How can localhost being rank0 be avoided without using -H?

Thanks,
saurabh

To unsubscribe from this group and stop receiving emails from it, send an email 
to users+unsubscr...@lists.open-mpi.org.


[OMPI users] Avoiding localhost as rank 0 with openmpi-default-hostfile

2023-11-06 Thread Saurabh T via users
My openmpi-default-hostfile has
host1 slots=4
host2 slots=4
host0 slots=4

and my openmpi-mca-params.conf has
rmaps_base_mapping_policy = node
rmaps_base_oversubscribe = 1

If I invoke orterun -np 3 on host0, it puts rank0 on host0, rank1 on host1, 
rank2 on host2. I want it to put rank0 on host1, rank1 on host2, rank2 on host0 
(as specified in the host file). I cannot use nolocal because I do want it to 
run on host0.

How can localhost being rank0 be avoided without using -H?

Thanks,
saurabh