Hi!
Thank you Jeff. I was able to run a case of OF  by setting the absolute path 
name  for mpiexec. But, when I wanted to run a coupled case in which  OF is 
coupled with dummyCSM through EMPIRE using these three command lines:
mpiexec -np 1 Emperor emperorInput.xml
mpiexec -np 1 dummyCSM dummyCSMInput
mpiexec -np 1 pimpleDyMFoam -case OF,

I was still getting OF not able to connect. In the user guide of EMPIRE, it is 
said that emperor (the client)  has to recognize the clients which are dummyCSM 
and OpenFOAM. For some reasons emperor is not able to recognize OpenFoam but it 
recognizes dummyCSM. 
What can be the cause that a server can not recognize a client ?
Regards,Islem 

    Le Mercredi 1 juin 2016 17h00, "users-requ...@open-mpi.org" 
<users-requ...@open-mpi.org> a écrit :
 

 Send users mailing list submissions to
    us...@open-mpi.org

To subscribe or unsubscribe via the World Wide Web, visit
    https://www.open-mpi.org/mailman/listinfo.cgi/users
or, via email, send a message with subject or body 'help' to
    users-requ...@open-mpi.org

You can reach the person managing the list at
    users-ow...@open-mpi.org

When replying, please edit your Subject line so it is more specific
than "Re: Contents of users digest..."


Today's Topics:

  1. Re: Firewall settings for MPI communication
      (Jeff Squyres (jsquyres))
  2. Re: users Digest, Vol 3514, Issue 1 (Jeff Squyres (jsquyres))


----------------------------------------------------------------------

Message: 1
List-Post: users@lists.open-mpi.org
Date: Wed, 1 Jun 2016 13:02:22 +0000
From: "Jeff Squyres (jsquyres)" <jsquy...@cisco.com>
To: "Open MPI User's List" <us...@open-mpi.org>
Subject: Re: [OMPI users] Firewall settings for MPI communication
Message-ID: <ae97b273-c16b-4bc3-bc30-924d25697...@cisco.com>
Content-Type: text/plain; charset="utf-8"

In addition, you might want to consider upgrading to Open MPI v1.10.x (v1.6.x 
is fairly ancient).

> On Jun 1, 2016, at 7:46 AM, Gilles Gouaillardet 
> <gilles.gouaillar...@gmail.com> wrote:
> 
> which network are your VMs using for communications ?
> if this is tcp, then you also have to specify a restricted set of allowed 
> ports for the tcp btl
> 
> that would be something like
> mpirun --mca btl_tcp_dynamic_ports 49990-50010 ...
> 
> please double check the Open MPI 1.6.5 parameter and syntax with
> ompi_info --all
> (or check the archives, I think I posted the correct command line a few weeks 
> ago)
> 
> Cheers,
> 
> Gilles
> 
> On Wednesday, June 1, 2016, Ping Wang <ping.w...@asc-s.de> wrote:
> I'm using Open MPI 1.6.5 to run OpenFOAM in parallel on several VMs on a 
> cloud. mpirun hangs without any error messages. I think this is a firewall 
> issue. Because when I open all the TCP ports(1-65535) in the security group 
> of VMs, mpirun works well. However I was suggested to open as less ports as 
> possible. So I have to limit MPI to run on a range of ports. I opened the 
> port range 49990-50010 for MPI communication. And use command
> 
>  
> 
> mpirun --mca oob_tcp_dynamic_ports 49990-50010 -np 4 --hostfile machines 
> simpleFoam ?parallel. 
> 
>  
> 
> But it still hangs. How can I specify a port range that OpenMPI will use? I 
> appreciate any help you can provide.
> 
>  
> 
> Best,
> 
> Ping Wang
> 
>  
> 
> <image001.png>
> 
> ------------------------------------------------------
> 
> Ping Wang
> 
> Automotive Simulation Center Stuttgart e.V.
> 
> Nobelstra?e 15
> 
> D-70569 Stuttgart
> 
> Telefon: +49 711 699659-14
> 
> Fax: +49 711 699659-29
> 
> E-Mail: ping.w...@asc-s.de
> 
> Web: http://www.asc-s.de
> 
> Social Media: <image002.gif>/asc.stuttgart
> 
> ------------------------------------------------------
> 
>  
> 
>  
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2016/06/29340.php


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/


------------------------------

Message: 2
List-Post: users@lists.open-mpi.org
Date: Wed, 1 Jun 2016 13:51:55 +0000
From: "Jeff Squyres (jsquyres)" <jsquy...@cisco.com>
To: Megdich Islem <megdich_is...@yahoo.fr>, "Open MPI User's List"
    <us...@open-mpi.org>
Subject: Re: [OMPI users] users Digest, Vol 3514, Issue 1
Message-ID: <c44110cb-7576-4272-bc53-9c274ef1e...@cisco.com>
Content-Type: text/plain; charset="utf-8"

The example you list below has all MPICH paths -- I don't see any Open MPI 
setups in there.

What I was suggesting was that if you absolutely need to have both Open MPI and 
MPICH installed and in your PATH / LD_LIBRARY_PATH / MANPATH, then you can use 
the full, absolute path name to each of the Open MPI executables -- e.g., 
/path/to/openmpi/install/bin/mpicc, etc.  That way, you can use Open MPI's 
mpicc without having it in your path.

Additionally, per https://www.open-mpi.org/faq/?category=running#mpirun-prefix, 
if you specify the absolute path name to mpirun (or mpiexec -- they're 
identical in Open MPI) and you're using the rsh/ssh launcher in Open MPI, then 
Open MPI will set the right PATH / LD_LIBRARY_PATH on remote servers for you.  
See the FAQ link for more detail.



> On Jun 1, 2016, at 8:41 AM, Megdich Islem <megdich_is...@yahoo.fr> wrote:
> 
> Hi!
> 
> Thank you Jeff for you suggestion. But, I am still not able to understand 
> what do you mean by using absolute path names to for 
> mpicc/mpifort-mpirun/mpiexec ?
> 
> This is how my .bashrc looks like
> 
> source /opt/openfoam30/etc/bashrc
> 
> export PATH=/home/Desktop/mpich/bin:$PATH
> export LD_LIBRARY_PATH="/home/islem/Desktop/mpich/lib/:$LD_LIBRARY_PATH"
> export MPICH_F90=gfortran
> export MPICH_CC=/opt/intel/bin/icc
> export MPICH_CXX=/opt/intel/bin/icpc
> export MPICH_LINK_CXX="-L/home/Desktop/mpich/lib/ -Wl,-rpath 
> -Wl,/home/islem/Desktop/mpich/lib -lmpichcxx -lmpich -lopa -lmpl -lrt 
> -lpthread"
> 
> export PATH=$PATH:/opt/intel/bin/
> LD_LIBRARY_PATH="/opt/intel/lib/intel64:$LD_LIBRARY_PATH"
> export LD_LIBRARY_PATH
> source 
> /opt/intel/compilers_and_libraries_2016.3.210/linux/mpi/intel64/mpivars.sh 
> intel64
> 
> alias startEMPIRE=". /home/islem/software/empire/EMPIRE-Core/etc/bashrc.sh 
> ICC"
> 
> mpirun --version gives mpich 3.0.4
> 
> This is how I run one example that couples 2 clients through the server 
> EMPIRE.
> I use three terminals, in each I write one of these command lines
> 
> mpiexec -np 1 Emperor emperorInput.xml  (I got a message in the terminal 
> saying that Empire started)
> 
> mpiexec -np 1 dummyCSM dummyCSMInput (I get a message that Emperor 
> acknowledged connection)
> mpiexec -np 1 pimpleDyMFoam -case OF (I got no message in the terminal which 
> means no connection)
> 
> How can I use the mpirun and where to right any modifications ?
> 
> Regards,
> Islem
> 
> 
> Le Vendredi 27 mai 2016 17h00, "users-requ...@open-mpi.org" 
> <users-requ...@open-mpi.org> a ?crit :
> 
> 
> Send users mailing list submissions to
>    us...@open-mpi.org
> 
> To subscribe or unsubscribe via the World Wide Web, visit
>    https://www.open-mpi.org/mailman/listinfo.cgi/users
> or, via email, send a message with subject or body 'help' to
>    users-requ...@open-mpi.org
> 
> You can reach the person managing the list at
>    users-ow...@open-mpi.org
> 
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of users digest..."
> 
> 
> Today's Topics:
> 
>  1. Re: users Digest, Vol 3510, Issue 2 (Jeff Squyres (jsquyres))
>  2. Re: segmentation fault for slot-list and openmpi-1.10.3rc2
>      (Siegmar Gross)
>  3. OpenMPI virtualization aware (Marco D'Amico)
>  4. Re: OpenMPI virtualization aware (Ralph Castain)
> 
> 
> ----------------------------------------------------------------------
> 
> Message: 1
> Date: Thu, 26 May 2016 23:28:17 +0000
> From: "Jeff Squyres (jsquyres)" <jsquy...@cisco.com>
> To: Megdich Islem <megdich_is...@yahoo.fr>, "Open MPI User's List"
>    <us...@open-mpi.org>
> Cc: Dave Love <d.l...@liverpool.ac.uk>
> Subject: Re: [OMPI users] users Digest, Vol 3510, Issue 2
> Message-ID: <441f803d-fdbb-443d-82aa-74ff3845a...@cisco.com>
> Content-Type: text/plain; charset="utf-8"
> 
> You're still intermingling your Open MPI and MPICH installations.
> 
> You need to ensure to use the wrapper compilers and mpirun/mpiexec from the 
> same MPI implementation.
> 
> For example, if you use mpicc/mpifort from Open MPI to build your program, 
> then you must use Open MPI's mpirun/mpiexec.
> 
> If you absolutely need to have both MPI implementations in your PATH / 
> LD_LIBRARY_PATH, you might want to use absolute path names to for 
> mpicc/mpifort/mpirun/mpiexec.
> 
> 
> 
> > On May 26, 2016, at 3:46 PM, Megdich Islem <megdich_is...@yahoo.fr> wrote:
> > 
> > Thank you all for your suggestions !!
> > 
> > I found an answer to a similar case in Open MPI FAQ (Question 15)
> > FAQ: Running MPI jobs
> >  
> >  
> > 
> >  
> >  
> >  
> >  
> >  
> > FAQ: Running MPI jobs
> > Table of contents: What pre-requisites are necessary for running an Open 
> > MPI job? What ABI guarantees does Open MPI provide? Do I need a common 
> > filesystem on a...
> > Afficher sur www.open-mpi.org
> > Aper?u par Yahoo
> >  
> > which suggests to use mpirun's  prefix command line option or to use the 
> > mpirun wrapper.
> > 
> > I modified my command  to the following
> >  mpirun --prefix 
> >/opt/openfoam30/platforms/linux64GccDPInt32Opt/lib/Openmpi-system -np 1 
> >pimpleDyMFoam -case OF
> > 
> > But, I got an error (see attached picture). Is the syntax correct? How can 
> > I solve the problem? That first method seems to be easier than using the 
> > mpirun wrapper.
> > 
> > Otherwise, how can I use the mpirun wrapper?
> > 
> > Regards,
> > islem
> > 
> > 
> > Le Mercredi 25 mai 2016 16h40, Dave Love <d.l...@liverpool.ac.uk> a ?crit :
> > 
> > 
> > I wrote:
> > 
> > 
> > > You could wrap one (set of) program(s) in a script to set the
> > > appropriate environment before invoking the real program. 
> > 
> > 
> > I realize I should have said something like "program invocations",
> > i.e. if you have no control over something invoking mpirun for programs
> > using different MPIs, then an mpirun wrapper needs to check what it's
> > being asked to run.
> > 
> > 
> > 
> > <mpirun-error.png><path-to-open-mpi.png>_______________________________________________
> > users mailing list
> > us...@open-mpi.org
> > Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> > Link to this post: 
> > http://www.open-mpi.org/community/lists/users/2016/05/29317.php
> 
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> ------------------------------
> 
> Message: 2
> Date: Fri, 27 May 2016 08:16:41 +0200
> From: Siegmar Gross <siegmar.gr...@informatik.hs-fulda.de>
> To: Open MPI Users <us...@open-mpi.org>
> Subject: Re: [OMPI users] segmentation fault for slot-list and
>    openmpi-1.10.3rc2
> Message-ID:
>    <f5653a5c-174f-4569-c730-082a9db82...@informatik.hs-fulda.de>
> Content-Type: text/plain; charset=windows-1252; format=flowed
> 
> Hi Ralph,
> 
> 
> Am 26.05.2016 um 17:38 schrieb Ralph Castain:
> > I?m afraid I honestly can?t make any sense of it. It seems
> > you at least have a simple workaround (use a hostfile instead
> > of -host), yes?
> 
> Only the combination "--host" and "--slot-list" breaks.
> Everything else works as expected. One more remark: As you
> can see below, this combination worked using gdb and "next"
> after the breakpoint. The process blocks, if I keep the
> enter-key pressed down and I have to kill simple_spawn in
> another window to get control back in gdb (<Ctrl-c> or
> anything else didn't work). I got this error yesterday
> evening.
> 
> ...
> (gdb)
> ompi_mpi_init (argc=0, argv=0x0, requested=0, provided=0x7fffffffbc0c)
>    at ../../openmpi-1.10.3rc3/ompi/runtime/ompi_mpi_init.c:738
> 738        if (OMPI_SUCCESS != (ret = ompi_file_init())) {
> (gdb)
> 744        if (OMPI_SUCCESS != (ret = ompi_win_init())) {
> (gdb)
> 750        if (OMPI_SUCCESS != (ret = ompi_attr_init())) {
> (gdb)
> 758        if (OMPI_SUCCESS != (ret = ompi_proc_complete_init())) {
> (gdb)
> 764        ret = MCA_PML_CALL(enable(true));
> (gdb)
> 765        if( OMPI_SUCCESS != ret ) {
> (gdb)
> 771        if (NULL == (procs = ompi_proc_world(&nprocs))) {
> (gdb)
> 775        ret = MCA_PML_CALL(add_procs(procs, nprocs));
> (gdb)
> 776        free(procs);
> (gdb)
> 780        if (OMPI_ERR_UNREACH == ret) {
> (gdb)
> 785        } else if (OMPI_SUCCESS != ret) {
> (gdb)
> 790        MCA_PML_CALL(add_comm(&ompi_mpi_comm_world.comm));
> (gdb)
> 791        MCA_PML_CALL(add_comm(&ompi_mpi_comm_self.comm));
> (gdb)
> 796        if (ompi_mpi_show_mca_params) {
> (gdb)
> 803        ompi_rte_wait_for_debugger();
> (gdb)
> 807        if (ompi_enable_timing && 0 == OMPI_PROC_MY_NAME->vpid) {
> (gdb)
> 817        coll = OBJ_NEW(ompi_rte_collective_t);
> (gdb)
> 818        coll->id = ompi_process_info.peer_init_barrier;
> (gdb)
> 819        coll->active = true;
> (gdb)
> 820        if (OMPI_SUCCESS != (ret = ompi_rte_barrier(coll))) {
> (gdb)
> 825        OMPI_WAIT_FOR_COMPLETION(coll->active);
> (gdb)
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> Program received signal SIGTERM, Terminated.
> 0x00007ffff7a7acd0 in opal_progress@plt ()
>    from /usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12
> (gdb)
> Single stepping until exit from function opal_progress@plt,
> which has no line number information.
> [Thread 0x7ffff491b700 (LWP 19602) exited]
> 
> Program terminated with signal SIGTERM, Terminated.
> The program no longer exists.
> (gdb)
> The program is not being run.
> (gdb)
> ...
> 
> 
> 
> Kind regards
> 
> Siegmar
> 
> 
> >> On May 26, 2016, at 5:48 AM, Siegmar Gross 
> >> <siegmar.gr...@informatik.hs-fulda.de> wrote:
> >>
> >> Hi Ralph and Gilles,
> >>
> >> it's strange that the program works with "--host" and "--slot-list"
> >> in your environment and not in mine. I get the following output, if
> >> I run the program in gdb without a breakpoint.
> >>
> >>
> >> loki spawn 142 gdb /usr/local/openmpi-1.10.3_64_gcc/bin/mpiexec
> >> GNU gdb (GDB; SUSE Linux Enterprise 12) 7.9.1
> >> ...
> >> (gdb) set args -np 1 --host loki --slot-list 0:0-1,1:0-1 simple_spawn
> >> (gdb) run
> >> Starting program: /usr/local/openmpi-1.10.3_64_gcc/bin/mpiexec -np 1 
> >> --host loki --slot-list 0:0-1,1:0-1 simple_spawn
> >> [Thread debugging using libthread_db enabled]
> >> Using host libthread_db library "/lib64/libthread_db.so.1".
> >> Detaching after fork from child process 18031.
> >> [pid 18031] starting up!
> >> 0 completed MPI_Init
> >> Parent [pid 18031] about to spawn!
> >> Detaching after fork from child process 18033.
> >> Detaching after fork from child process 18034.
> >> [pid 18033] starting up!
> >> [pid 18034] starting up!
> >> [loki:18034] *** Process received signal ***
> >> [loki:18034] Signal: Segmentation fault (11)
> >> ...
> >>
> >>
> >>
> >> I get a different output, if I run the program in gdb with
> >> a breakpoint.
> >>
> >> gdb /usr/local/openmpi-1.10.3_64_gcc/bin/mpiexec
> >> (gdb) set args -np 1 --host loki --slot-list 0:0-1,1:0-1 simple_spawn
> >> (gbd) set follow-fork-mode child
> >> (gdb) break ompi_proc_self
> >> (gdb) run
> >> (gdb) next
> >>
> >> Repeating "next" very often results in the following output.
> >>
> >> ...
> >> Starting program: 
> >> /home/fd1026/work/skripte/master/parallel/prog/mpi/spawn/simple_spawn
> >> [Thread debugging using libthread_db enabled]
> >> Using host libthread_db library "/lib64/libthread_db.so.1".
> >> [pid 13277] starting up!
> >> [New Thread 0x7ffff42ef700 (LWP 13289)]
> >>
> >> Breakpoint 1, ompi_proc_self (size=0x7fffffffc060)
> >>    at ../../openmpi-1.10.3rc3/ompi/proc/proc.c:413
> >> 413        ompi_proc_t **procs = (ompi_proc_t**) 
> >> malloc(sizeof(ompi_proc_t*));
> >> (gdb) n
> >> 414        if (NULL == procs) {
> >> (gdb)
> >> 423        OBJ_RETAIN(ompi_proc_local_proc);
> >> (gdb)
> >> 424        *procs = ompi_proc_local_proc;
> >> (gdb)
> >> 425        *size = 1;
> >> (gdb)
> >> 426        return procs;
> >> (gdb)
> >> 427    }
> >> (gdb)
> >> ompi_comm_init () at 
> >> ../../openmpi-1.10.3rc3/ompi/communicator/comm_init.c:138
> >> 138        group->grp_my_rank      = 0;
> >> (gdb)
> >> 139        group->grp_proc_count    = (int)size;
> >> ...
> >> 193        ompi_comm_reg_init();
> >> (gdb)
> >> 196        ompi_comm_request_init ();
> >> (gdb)
> >> 198        return OMPI_SUCCESS;
> >> (gdb)
> >> 199    }
> >> (gdb)
> >> ompi_mpi_init (argc=0, argv=0x0, requested=0, provided=0x7fffffffc21c)
> >>    at ../../openmpi-1.10.3rc3/ompi/runtime/ompi_mpi_init.c:738
> >> 738        if (OMPI_SUCCESS != (ret = ompi_file_init())) {
> >> (gdb)
> >> 744        if (OMPI_SUCCESS != (ret = ompi_win_init())) {
> >> (gdb)
> >> 750        if (OMPI_SUCCESS != (ret = ompi_attr_init())) {
> >> ...
> >> 988        ompi_mpi_initialized = true;
> >> (gdb)
> >> 991        if (ompi_enable_timing && 0 == OMPI_PROC_MY_NAME->vpid) {
> >> (gdb)
> >> 999        return MPI_SUCCESS;
> >> (gdb)
> >> 1000    }
> >> (gdb)
> >> PMPI_Init (argc=0x0, argv=0x0) at pinit.c:94
> >> 94          if (MPI_SUCCESS != err) {
> >> (gdb)
> >> 104        return MPI_SUCCESS;
> >> (gdb)
> >> 105    }
> >> (gdb)
> >> 0x0000000000400d0c in main ()
> >> (gdb)
> >> Single stepping until exit from function main,
> >> which has no line number information.
> >> 0 completed MPI_Init
> >> Parent [pid 13277] about to spawn!
> >> [New process 13472]
> >> [Thread debugging using libthread_db enabled]
> >> Using host libthread_db library "/lib64/libthread_db.so.1".
> >> process 13472 is executing new program: 
> >> /usr/local/openmpi-1.10.3_64_gcc/bin/orted
> >> [Thread debugging using libthread_db enabled]
> >> Using host libthread_db library "/lib64/libthread_db.so.1".
> >> [New process 13474]
> >> [Thread debugging using libthread_db enabled]
> >> Using host libthread_db library "/lib64/libthread_db.so.1".
> >> process 13474 is executing new program: 
> >> /home/fd1026/work/skripte/master/parallel/prog/mpi/spawn/simple_spawn
> >> [pid 13475] starting up!
> >> [pid 13476] starting up!
> >> [Thread debugging using libthread_db enabled]
> >> Using host libthread_db library "/lib64/libthread_db.so.1".
> >> [pid 13474] starting up!
> >> [New Thread 0x7ffff491b700 (LWP 13480)]
> >> [Switching to Thread 0x7ffff7ff1740 (LWP 13474)]
> >>
> >> Breakpoint 1, ompi_proc_self (size=0x7fffffffba30)
> >>    at ../../openmpi-1.10.3rc3/ompi/proc/proc.c:413
> >> 413        ompi_proc_t **procs = (ompi_proc_t**) 
> >> malloc(sizeof(ompi_proc_t*));
> >> (gdb)
> >> 414        if (NULL == procs) {
> >> ...
> >> 426        return procs;
> >> (gdb)
> >> 427    }
> >> (gdb)
> >> ompi_comm_init () at 
> >> ../../openmpi-1.10.3rc3/ompi/communicator/comm_init.c:138
> >> 138        group->grp_my_rank      = 0;
> >> (gdb)
> >> 139        group->grp_proc_count    = (int)size;
> >> (gdb)
> >> 140        OMPI_GROUP_SET_INTRINSIC (group);
> >> ...
> >> 193        ompi_comm_reg_init();
> >> (gdb)
> >> 196        ompi_comm_request_init ();
> >> (gdb)
> >> 198        return OMPI_SUCCESS;
> >> (gdb)
> >> 199    }
> >> (gdb)
> >> ompi_mpi_init (argc=0, argv=0x0, requested=0, provided=0x7fffffffbbec)
> >>    at ../../openmpi-1.10.3rc3/ompi/runtime/ompi_mpi_init.c:738
> >> 738        if (OMPI_SUCCESS != (ret = ompi_file_init())) {
> >> (gdb)
> >> 744        if (OMPI_SUCCESS != (ret = ompi_win_init())) {
> >> (gdb)
> >> 750        if (OMPI_SUCCESS != (ret = ompi_attr_init())) {
> >> ...
> >> 863        if (OMPI_SUCCESS != (ret = ompi_pubsub_base_select())) {
> >> (gdb)
> >> 869        if (OMPI_SUCCESS != (ret = 
> >> mca_base_framework_open(&ompi_dpm_base_framework, 0))) {
> >> (gdb)
> >> 873        if (OMPI_SUCCESS != (ret = ompi_dpm_base_select())) {
> >> (gdb)
> >> 884        if ( OMPI_SUCCESS !=
> >> (gdb)
> >> 894        if (OMPI_SUCCESS !=
> >> (gdb)
> >> 900        if (OMPI_SUCCESS !=
> >> (gdb)
> >> 911        if (OMPI_SUCCESS != (ret = ompi_dpm.dyn_init())) {
> >> (gdb)
> >> Parent done with spawn
> >> Parent sending message to child
> >> 2 completed MPI_Init
> >> Hello from the child 2 of 3 on host loki pid 13476
> >> 1 completed MPI_Init
> >> Hello from the child 1 of 3 on host loki pid 13475
> >> 921        if (OMPI_SUCCESS != (ret = ompi_cr_init())) {
> >> (gdb)
> >> 931        opal_progress_event_users_decrement();
> >> (gdb)
> >> 934        opal_progress_set_yield_when_idle(ompi_mpi_yield_when_idle);
> >> (gdb)
> >> 937        if (ompi_mpi_event_tick_rate >= 0) {
> >> (gdb)
> >> 946        if (OMPI_SUCCESS != (ret = ompi_mpiext_init())) {
> >> (gdb)
> >> 953        if (ret != OMPI_SUCCESS) {
> >> (gdb)
> >> 972        OBJ_CONSTRUCT(&ompi_registered_datareps, opal_list_t);
> >> (gdb)
> >> 977        OBJ_CONSTRUCT( &ompi_mpi_f90_integer_hashtable, 
> >> opal_hash_table_t);
> >> (gdb)
> >> 978        opal_hash_table_init(&ompi_mpi_f90_integer_hashtable, 16 /* why 
> >> not? */);
> >> (gdb)
> >> 980        OBJ_CONSTRUCT( &ompi_mpi_f90_real_hashtable, opal_hash_table_t);
> >> (gdb)
> >> 981        opal_hash_table_init(&ompi_mpi_f90_real_hashtable, 
> >> FLT_MAX_10_EXP);
> >> (gdb)
> >> 983        OBJ_CONSTRUCT( &ompi_mpi_f90_complex_hashtable, 
> >> opal_hash_table_t);
> >> (gdb)
> >> 984        opal_hash_table_init(&ompi_mpi_f90_complex_hashtable, 
> >> FLT_MAX_10_EXP);
> >> (gdb)
> >> 988        ompi_mpi_initialized = true;
> >> (gdb)
> >> 991        if (ompi_enable_timing && 0 == OMPI_PROC_MY_NAME->vpid) {
> >> (gdb)
> >> 999        return MPI_SUCCESS;
> >> (gdb)
> >> 1000    }
> >> (gdb)
> >> PMPI_Init (argc=0x0, argv=0x0) at pinit.c:94
> >> 94          if (MPI_SUCCESS != err) {
> >> (gdb)
> >> 104        return MPI_SUCCESS;
> >> (gdb)
> >> 105    }
> >> (gdb)
> >> 0x0000000000400d0c in main ()
> >> (gdb)
> >> Single stepping until exit from function main,
> >> which has no line number information.
> >> 0 completed MPI_Init
> >> Hello from the child 0 of 3 on host loki pid 13474
> >>
> >> Child 2 disconnected
> >> Child 1 disconnected
> >> Child 0 received msg: 38
> >> Parent disconnected
> >> 13277: exiting
> >>
> >> Program received signal SIGTERM, Terminated.
> >> 0x0000000000400f0a in main ()
> >> (gdb)
> >> Single stepping until exit from function main,
> >> which has no line number information.
> >> [tcsetpgrp failed in terminal_inferior: No such process]
> >> [Thread 0x7ffff491b700 (LWP 13480) exited]
> >>
> >> Program terminated with signal SIGTERM, Terminated.
> >> The program no longer exists.
> >> (gdb)
> >> The program is not being run.
> >> (gdb)
> >> The program is not being run.
> >> (gdb) info break
> >> Num    Type          Disp Enb Address            What
> >> 1      breakpoint    keep y  0x00007ffff7aa35c7 in ompi_proc_self
> >>                                                  at 
> >>../../openmpi-1.10.3rc3/ompi/proc/proc.c:413 inf 8, 7, 6, 5, 4, 3, 2, 1
> >>        breakpoint already hit 2 times
> >> (gdb) delete 1
> >> (gdb) r
> >> Starting program: 
> >> /home/fd1026/work/skripte/master/parallel/prog/mpi/spawn/simple_spawn
> >> [Thread debugging using libthread_db enabled]
> >> Using host libthread_db library "/lib64/libthread_db.so.1".
> >> [pid 16708] starting up!
> >> 0 completed MPI_Init
> >> Parent [pid 16708] about to spawn!
> >> [New process 16720]
> >> [Thread debugging using libthread_db enabled]
> >> Using host libthread_db library "/lib64/libthread_db.so.1".
> >> process 16720 is executing new program: 
> >> /usr/local/openmpi-1.10.3_64_gcc/bin/orted
> >> [Thread debugging using libthread_db enabled]
> >> Using host libthread_db library "/lib64/libthread_db.so.1".
> >> [New process 16722]
> >> [Thread debugging using libthread_db enabled]
> >> Using host libthread_db library "/lib64/libthread_db.so.1".
> >> process 16722 is executing new program: 
> >> /home/fd1026/work/skripte/master/parallel/prog/mpi/spawn/simple_spawn
> >> [pid 16723] starting up!
> >> [pid 16724] starting up!
> >> [Thread debugging using libthread_db enabled]
> >> Using host libthread_db library "/lib64/libthread_db.so.1".
> >> [pid 16722] starting up!
> >> Parent done with spawn
> >> Parent sending message to child
> >> 1 completed MPI_Init
> >> Hello from the child 1 of 3 on host loki pid 16723
> >> 2 completed MPI_Init
> >> Hello from the child 2 of 3 on host loki pid 16724
> >> 0 completed MPI_Init
> >> Hello from the child 0 of 3 on host loki pid 16722
> >> Child 0 received msg: 38
> >> Child 0 disconnected
> >> Parent disconnected
> >> Child 1 disconnected
> >> Child 2 disconnected
> >> 16708: exiting
> >> 16724: exiting
> >> 16723: exiting
> >> [New Thread 0x7ffff491b700 (LWP 16729)]
> >>
> >> Program received signal SIGTERM, Terminated.
> >> [Switching to Thread 0x7ffff7ff1740 (LWP 16722)]
> >> __GI__dl_debug_state () at dl-debug.c:74
> >> 74      dl-debug.c: No such file or directory.
> >> (gdb) 
> >> --------------------------------------------------------------------------
> >> WARNING: A process refused to die despite all the efforts!
> >> This process may still be running and/or consuming resources.
> >>
> >> Host: loki
> >> PID:  16722
> >>
> >> --------------------------------------------------------------------------
> >>
> >>
> >> The following simple_spawn processes exist now.
> >>
> >> loki spawn 171 ps -aef | grep simple_spawn
> >> fd1026  11079 11053  0 14:00 pts/0    00:00:00 
> >> /usr/local/openmpi-1.10.3_64_gcc/bin/mpiexec -np 1 --host loki --slot-list 
> >> 0:0-1,1:0-1 simple_spawn
> >> fd1026  11095 11079 29 14:01 pts/0    00:09:37 [simple_spawn] <defunct>
> >> fd1026  16722    1  0 14:31 ?        00:00:00 [simple_spawn] <defunct>
> >> fd1026  17271 29963  0 14:33 pts/2    00:00:00 grep simple_spawn
> >> loki spawn 172
> >>
> >>
> >> Is it possible that there is a race condition? How can I help
> >> to get a solution for my problem?
> >>
> >>
> >> Kind regards
> >>
> >> Siegmar
> >>
> >> Am 24.05.2016 um 16:54 schrieb Ralph Castain:
> >>> Works perfectly for me, so I believe this must be an environment issue - 
> >>> I am using gcc 6.0.0 on CentOS7 with x86:
> >>>
> >>> $ mpirun -n 1 -host bend001 --slot-list 0:0-1,1:0-1 --report-bindings 
> >>> ./simple_spawn
> >>> [bend001:17599] MCW rank 0 bound to socket 0[core 0[hwt 0-1]], socket 
> >>> 0[core 1[hwt 0-1]], socket 1[core 6[hwt 0-1]], socket 1[core 7[hwt 0-1]]: 
> >>> [BB/BB/../../../..][BB/BB/../../../..]
> >>> [pid 17601] starting up!
> >>> 0 completed MPI_Init
> >>> Parent [pid 17601] about to spawn!
> >>> [pid 17603] starting up!
> >>> [bend001:17599] MCW rank 0 bound to socket 0[core 0[hwt 0-1]], socket 
> >>> 0[core 1[hwt 0-1]], socket 1[core 6[hwt 0-1]], socket 1[core 7[hwt 0-1]]: 
> >>> [BB/BB/../../../..][BB/BB/../../../..]
> >>> [bend001:17599] MCW rank 1 bound to socket 0[core 0[hwt 0-1]], socket 
> >>> 0[core 1[hwt 0-1]], socket 1[core 6[hwt 0-1]], socket 1[core 7[hwt 0-1]]: 
> >>> [BB/BB/../../../..][BB/BB/../../../..]
> >>> [bend001:17599] MCW rank 2 bound to socket 0[core 0[hwt 0-1]], socket 
> >>> 0[core 1[hwt 0-1]], socket 1[core 6[hwt 0-1]], socket 1[core 7[hwt 0-1]]: 
> >>> [BB/BB/../../../..][BB/BB/../../../..]
> >>> [pid 17604] starting up!
> >>> [pid 17605] starting up!
> >>> Parent done with spawn
> >>> Parent sending message to child
> >>> 0 completed MPI_Init
> >>> Hello from the child 0 of 3 on host bend001 pid 17603
> >>> Child 0 received msg: 38
> >>> 1 completed MPI_Init
> >>> Hello from the child 1 of 3 on host bend001 pid 17604
> >>> 2 completed MPI_Init
> >>> Hello from the child 2 of 3 on host bend001 pid 17605
> >>> Child 0 disconnected
> >>> Child 2 disconnected
> >>> Parent disconnected
> >>> Child 1 disconnected
> >>> 17603: exiting
> >>> 17605: exiting
> >>> 17601: exiting
> >>> 17604: exiting
> >>> $
> >>>
> >>>> On May 24, 2016, at 7:18 AM, Siegmar Gross 
> >>>> <siegmar.gr...@informatik.hs-fulda.de> wrote:
> >>>>
> >>>> Hi Ralph and Gilles,
> >>>>
> >>>> the program breaks only, if I combine "--host" and "--slot-list". 
> >>>> Perhaps this
> >>>> information is helpful. I use a different machine now, so that you can 
> >>>> see that
> >>>> the problem is not restricted to "loki".
> >>>>
> >>>>
> >>>> pc03 spawn 115 ompi_info | grep -e "OPAL repo revision:" -e "C compiler 
> >>>> absolute:"
> >>>>    OPAL repo revision: v1.10.2-201-gd23dda8
> >>>>    C compiler absolute: /usr/local/gcc-6.1.0/bin/gcc
> >>>>
> >>>>
> >>>> pc03 spawn 116 uname -a
> >>>> Linux pc03 3.12.55-52.42-default #1 SMP Thu Mar 3 10:35:46 UTC 2016 
> >>>> (4354e1d) x86_64 x86_64 x86_64 GNU/Linux
> >>>>
> >>>>
> >>>> pc03 spawn 117 cat host_pc03.openmpi
> >>>> pc03.informatik.hs-fulda.de slots=12 max_slots=12
> >>>>
> >>>>
> >>>> pc03 spawn 118 mpicc simple_spawn.c
> >>>>
> >>>>
> >>>> pc03 spawn 119 mpiexec -np 1 --report-bindings a.out
> >>>> [pc03:03711] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: 
> >>>> [BB/../../../../..][../../../../../..]
> >>>> [pid 3713] starting up!
> >>>> 0 completed MPI_Init
> >>>> Parent [pid 3713] about to spawn!
> >>>> [pc03:03711] MCW rank 0 bound to socket 1[core 6[hwt 0-1]], socket 
> >>>> 1[core 7[hwt 0-1]], socket 1[core 8[hwt 0-1]], socket 1[core 9[hwt 
> >>>> 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]]: 
> >>>> [../../../../../..][BB/BB/BB/BB/BB/BB]
> >>>> [pc03:03711] MCW rank 1 bound to socket 0[core 0[hwt 0-1]], socket 
> >>>> 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 
> >>>> 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]]: 
> >>>> [BB/BB/BB/BB/BB/BB][../../../../../..]
> >>>> [pc03:03711] MCW rank 2 bound to socket 1[core 6[hwt 0-1]], socket 
> >>>> 1[core 7[hwt 0-1]], socket 1[core 8[hwt 0-1]], socket 1[core 9[hwt 
> >>>> 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]]: 
> >>>> [../../../../../..][BB/BB/BB/BB/BB/BB]
> >>>> [pid 3715] starting up!
> >>>> [pid 3716] starting up!
> >>>> [pid 3717] starting up!
> >>>> Parent done with spawn
> >>>> Parent sending message to child
> >>>> 0 completed MPI_Init
> >>>> Hello from the child 0 of 3 on host pc03 pid 3715
> >>>> 1 completed MPI_Init
> >>>> Hello from the child 1 of 3 on host pc03 pid 3716
> >>>> 2 completed MPI_Init
> >>>> Hello from the child 2 of 3 on host pc03 pid 3717
> >>>> Child 0 received msg: 38
> >>>> Child 0 disconnected
> >>>> Child 2 disconnected
> >>>> Parent disconnected
> >>>> Child 1 disconnected
> >>>> 3713: exiting
> >>>> 3715: exiting
> >>>> 3716: exiting
> >>>> 3717: exiting
> >>>>
> >>>>
> >>>> pc03 spawn 120 mpiexec -np 1 --hostfile host_pc03.openmpi --slot-list 
> >>>> 0:0-1,1:0-1 --report-bindings a.out
> >>>> [pc03:03729] MCW rank 0 bound to socket 0[core 0[hwt 0-1]], socket 
> >>>> 0[core 1[hwt 0-1]], socket 1[core 6[hwt 0-1]], socket 1[core 7[hwt 
> >>>> 0-1]]: [BB/BB/../../../..][BB/BB/../../../..]
> >>>> [pid 3731] starting up!
> >>>> 0 completed MPI_Init
> >>>> Parent [pid 3731] about to spawn!
> >>>> [pc03:03729] MCW rank 0 bound to socket 0[core 0[hwt 0-1]], socket 
> >>>> 0[core 1[hwt 0-1]], socket 1[core 6[hwt 0-1]], socket 1[core 7[hwt 
> >>>> 0-1]]: [BB/BB/../../../..][BB/BB/../../../..]
> >>>> [pc03:03729] MCW rank 1 bound to socket 0[core 0[hwt 0-1]], socket 
> >>>> 0[core 1[hwt 0-1]], socket 1[core 6[hwt 0-1]], socket 1[core 7[hwt 
> >>>> 0-1]]: [BB/BB/../../../..][BB/BB/../../../..]
> >>>> [pc03:03729] MCW rank 2 bound to socket 0[core 0[hwt 0-1]], socket 
> >>>> 0[core 1[hwt 0-1]], socket 1[core 6[hwt 0-1]], socket 1[core 7[hwt 
> >>>> 0-1]]: [BB/BB/../../../..][BB/BB/../../../..]
> >>>> [pid 3733] starting up!
> >>>> [pid 3734] starting up!
> >>>> [pid 3735] starting up!
> >>>> Parent done with spawn
> >>>> Parent sending message to child
> >>>> 2 completed MPI_Init
> >>>> Hello from the child 2 of 3 on host pc03 pid 3735
> >>>> 1 completed MPI_Init
> >>>> Hello from the child 1 of 3 on host pc03 pid 3734
> >>>> 0 completed MPI_Init
> >>>> Hello from the child 0 of 3 on host pc03 pid 3733
> >>>> Child 0 received msg: 38
> >>>> Child 0 disconnected
> >>>> Child 2 disconnected
> >>>> Child 1 disconnected
> >>>> Parent disconnected
> >>>> 3731: exiting
> >>>> 3734: exiting
> >>>> 3733: exiting
> >>>> 3735: exiting
> >>>>
> >>>>
> >>>> pc03 spawn 121 mpiexec -np 1 --host pc03 --slot-list 0:0-1,1:0-1 
> >>>> --report-bindings a.out
> >>>> [pc03:03744] MCW rank 0 bound to socket 0[core 0[hwt 0-1]], socket 
> >>>> 0[core 1[hwt 0-1]], socket 1[core 6[hwt 0-1]], socket 1[core 7[hwt 
> >>>> 0-1]]: [BB/BB/../../../..][BB/BB/../../../..]
> >>>> [pid 3746] starting up!
> >>>> 0 completed MPI_Init
> >>>> Parent [pid 3746] about to spawn!
> >>>> [pc03:03744] MCW rank 0 bound to socket 0[core 0[hwt 0-1]], socket 
> >>>> 0[core 1[hwt 0-1]], socket 1[core 6[hwt 0-1]], socket 1[core 7[hwt 
> >>>> 0-1]]: [BB/BB/../../../..][BB/BB/../../../..]
> >>>> [pc03:03744] MCW rank 2 bound to socket 0[core 0[hwt 0-1]], socket 
> >>>> 0[core 1[hwt 0-1]], socket 1[core 6[hwt 0-1]], socket 1[core 7[hwt 
> >>>> 0-1]]: [BB/BB/../../../..][BB/BB/../../../..]
> >>>> [pid 3748] starting up!
> >>>> [pid 3749] starting up!
> >>>> [pc03:03749] *** Process received signal ***
> >>>> [pc03:03749] Signal: Segmentation fault (11)
> >>>> [pc03:03749] Signal code: Address not mapped (1)
> >>>> [pc03:03749] Failing at address: 0x8
> >>>> [pc03:03749] [ 0] /lib64/libpthread.so.0(+0xf870)[0x7fe6f0d1f870]
> >>>> [pc03:03749] [ 1] 
> >>>> /usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(ompi_proc_self+0x35)[0x7fe6f0f825b0]
> >>>> [pc03:03749] [ 2] 
> >>>> /usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(ompi_comm_init+0x68b)[0x7fe6f0f61b08]
> >>>> [pc03:03749] [ 3] 
> >>>> /usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(ompi_mpi_init+0xa90)[0x7fe6f0f87e8a]
> >>>> [pc03:03749] [ 4] 
> >>>> /usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(MPI_Init+0x1a0)[0x7fe6f0fc42ae]
> >>>> [pc03:03749] [ 5] a.out[0x400d0c]
> >>>> [pc03:03749] [ 6] 
> >>>> /lib64/libc.so.6(__libc_start_main+0xf5)[0x7fe6f0989b05]
> >>>> [pc03:03749] [ 7] a.out[0x400bf9]
> >>>> [pc03:03749] *** End of error message ***
> >>>> --------------------------------------------------------------------------
> >>>> mpiexec noticed that process rank 2 with PID 3749 on node pc03 exited on 
> >>>> signal 11 (Segmentation fault).
> >>>> --------------------------------------------------------------------------
> >>>> pc03 spawn 122
> >>>>
> >>>>
> >>>>
> >>>> Kind regards
> >>>>
> >>>> Siegmar
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> On 05/24/16 15:44, Ralph Castain wrote:
> >>>>>
> >>>>>> On May 24, 2016, at 6:21 AM, Siegmar Gross 
> >>>>>> <siegmar.gr...@informatik.hs-fulda.de> wrote:
> >>>>>>
> >>>>>> Hi Ralph,
> >>>>>>
> >>>>>> I copy the relevant lines to this place, so that it is easier to see 
> >>>>>> what
> >>>>>> happens. "a.out" is your program, which I compiled with mpicc.
> >>>>>>
> >>>>>>>> loki spawn 153 ompi_info | grep -e "OPAL repo revision:" -e "C 
> >>>>>>>> compiler
> >>>>>>>> absolute:"
> >>>>>>>>    OPAL repo revision: v1.10.2-201-gd23dda8
> >>>>>>>>  C compiler absolute: /usr/local/gcc-6.1.0/bin/gcc
> >>>>>>>> loki spawn 154 mpicc simple_spawn.c
> >>>>>>
> >>>>>>>> loki spawn 155 mpiexec -np 1 a.out
> >>>>>>>> [pid 24008] starting up!
> >>>>>>>> 0 completed MPI_Init
> >>>>>> ...
> >>>>>>
> >>>>>> "mpiexec -np 1 a.out" works.
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>> I don?t know what ?a.out? is, but it looks like there is some memory
> >>>>>>> corruption there.
> >>>>>>
> >>>>>> "a.out" is still your program. I get the same error on different
> >>>>>> machines, so that it is not very likely, that the (hardware) memory
> >>>>>> is corrupted.
> >>>>>>
> >>>>>>
> >>>>>>>> loki spawn 156 mpiexec -np 1 --host loki --slot-list 0-5 a.out
> >>>>>>>> [pid 24102] starting up!
> >>>>>>>> 0 completed MPI_Init
> >>>>>>>> Parent [pid 24102] about to spawn!
> >>>>>>>> [pid 24104] starting up!
> >>>>>>>> [pid 24105] starting up!
> >>>>>>>> [loki:24105] *** Process received signal ***
> >>>>>>>> [loki:24105] Signal: Segmentation fault (11)
> >>>>>>>> [loki:24105] Signal code: Address not mapped (1)
> >>>>>> ...
> >>>>>>
> >>>>>> "mpiexec -np 1 --host loki --slot-list 0-5 a.out" breaks with a 
> >>>>>> segmentation
> >>>>>> faUlt. Can I do something, so that you can find out, what happens?
> >>>>>
> >>>>> I honestly have no idea - perhaps Gilles can help, as I have no access 
> >>>>> to that kind of environment. We aren?t seeing such problems elsewhere, 
> >>>>> so it is likely something local.
> >>>>>
> >>>>>>
> >>>>>>
> >>>>>> Kind regards
> >>>>>>
> >>>>>> Siegmar
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On 05/24/16 15:07, Ralph Castain wrote:
> >>>>>>>
> >>>>>>>> On May 24, 2016, at 4:19 AM, Siegmar Gross
> >>>>>>>> <siegmar.gr...@informatik.hs-fulda.de
> >>>>>>>> <mailto:siegmar.gr...@informatik.hs-fulda.de>> wrote:
> >>>>>>>>
> >>>>>>>> Hi Ralph,
> >>>>>>>>
> >>>>>>>> thank you very much for your answer and your example program.
> >>>>>>>>
> >>>>>>>> On 05/23/16 17:45, Ralph Castain wrote:
> >>>>>>>>> I cannot replicate the problem - both scenarios work fine for me. 
> >>>>>>>>> I?m not
> >>>>>>>>> convinced your test code is correct, however, as you call Comm_free 
> >>>>>>>>> the
> >>>>>>>>> inter-communicator but didn?t call Comm_disconnect. Checkout the 
> >>>>>>>>> attached
> >>>>>>>>> for a correct code and see if it works for you.
> >>>>>>>>
> >>>>>>>> I thought that I only need MPI_Comm_Disconnect, if I would have 
> >>>>>>>> established a
> >>>>>>>> connection with MPI_Comm_connect before. The man page for 
> >>>>>>>> MPI_Comm_free states
> >>>>>>>>
> >>>>>>>> "This  operation marks the communicator object for deallocation. The
> >>>>>>>> handle is set to MPI_COMM_NULL. Any pending operations that use this
> >>>>>>>> communicator will complete normally; the object is actually 
> >>>>>>>> deallocated only
> >>>>>>>> if there are no other active references to it.".
> >>>>>>>>
> >>>>>>>> The man page for MPI_Comm_disconnect states
> >>>>>>>>
> >>>>>>>> "MPI_Comm_disconnect waits for all pending communication on comm to 
> >>>>>>>> complete
> >>>>>>>> internally, deallocates the communicator object, and sets the handle 
> >>>>>>>> to
> >>>>>>>> MPI_COMM_NULL. It is  a  collective operation.".
> >>>>>>>>
> >>>>>>>> I don't see a difference for my spawned processes, because both 
> >>>>>>>> functions will
> >>>>>>>> "wait" until all pending operations have finished, before the object 
> >>>>>>>> will be
> >>>>>>>> destroyed. Nevertheless, perhaps my small example program worked all 
> >>>>>>>> the years
> >>>>>>>> by chance.
> >>>>>>>>
> >>>>>>>> However, I don't understand, why my program works with
> >>>>>>>> "mpiexec -np 1 --host loki,loki,loki,loki,loki spawn_master" and 
> >>>>>>>> breaks with
> >>>>>>>> "mpiexec -np 1 --host loki --slot-list 0:0-5,1:0-5 spawn_master". 
> >>>>>>>> You are right,
> >>>>>>>> my slot-list is equivalent to "-bind-to none". I could also have used
> >>>>>>>> "mpiexec -np 1 --host loki --oversubscribe spawn_master" which works 
> >>>>>>>> as well.
> >>>>>>>
> >>>>>>> Well, you are only giving us one slot when you specify "-host loki?, 
> >>>>>>> and then
> >>>>>>> you are trying to launch multiple processes into it. The ?slot-list? 
> >>>>>>> option only
> >>>>>>> tells us what cpus to bind each process to - it doesn?t allocate 
> >>>>>>> process slots.
> >>>>>>> So you have to tell us how many processes are allowed to run on this 
> >>>>>>> node.
> >>>>>>>
> >>>>>>>>
> >>>>>>>> The program breaks with "There are not enough slots available in the 
> >>>>>>>> system
> >>>>>>>> to satisfy ...", if I only use "--host loki" or different host names,
> >>>>>>>> without mentioning five host names, using "slot-list", or 
> >>>>>>>> "oversubscribe",
> >>>>>>>> Unfortunately "--host <host name>:<number of slots>" isn't available 
> >>>>>>>> for
> >>>>>>>> openmpi-1.10.3rc2 to specify the number of available slots.
> >>>>>>>
> >>>>>>> Correct - we did not backport the new syntax
> >>>>>>>
> >>>>>>>>
> >>>>>>>> Your program behaves the same way as mine, so that 
> >>>>>>>> MPI_Comm_disconnect
> >>>>>>>> will not solve my problem. I had to modify your program in a 
> >>>>>>>> negligible way
> >>>>>>>> to get it compiled.
> >>>>>>>>
> >>>>>>>> loki spawn 153 ompi_info | grep -e "OPAL repo revision:" -e "C 
> >>>>>>>> compiler absolute:"
> >>>>>>>>  OPAL repo revision: v1.10.2-201-gd23dda8
> >>>>>>>>  C compiler absolute: /usr/local/gcc-6.1.0/bin/gcc
> >>>>>>>> loki spawn 154 mpicc simple_spawn.c
> >>>>>>>> loki spawn 155 mpiexec -np 1 a.out
> >>>>>>>> [pid 24008] starting up!
> >>>>>>>> 0 completed MPI_Init
> >>>>>>>> Parent [pid 24008] about to spawn!
> >>>>>>>> [pid 24010] starting up!
> >>>>>>>> [pid 24011] starting up!
> >>>>>>>> [pid 24012] starting up!
> >>>>>>>> Parent done with spawn
> >>>>>>>> Parent sending message to child
> >>>>>>>> 0 completed MPI_Init
> >>>>>>>> Hello from the child 0 of 3 on host loki pid 24010
> >>>>>>>> 1 completed MPI_Init
> >>>>>>>> Hello from the child 1 of 3 on host loki pid 24011
> >>>>>>>> 2 completed MPI_Init
> >>>>>>>> Hello from the child 2 of 3 on host loki pid 24012
> >>>>>>>> Child 0 received msg: 38
> >>>>>>>> Child 0 disconnected
> >>>>>>>> Child 1 disconnected
> >>>>>>>> Child 2 disconnected
> >>>>>>>> Parent disconnected
> >>>>>>>> 24012: exiting
> >>>>>>>> 24010: exiting
> >>>>>>>> 24008: exiting
> >>>>>>>> 24011: exiting
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Is something wrong with my command line? I didn't use slot-list 
> >>>>>>>> before, so
> >>>>>>>> that I'm not sure, if I use it in the intended way.
> >>>>>>>
> >>>>>>> I don?t know what ?a.out? is, but it looks like there is some memory 
> >>>>>>> corruption
> >>>>>>> there.
> >>>>>>>
> >>>>>>>>
> >>>>>>>> loki spawn 156 mpiexec -np 1 --host loki --slot-list 0-5 a.out
> >>>>>>>> [pid 24102] starting up!
> >>>>>>>> 0 completed MPI_Init
> >>>>>>>> Parent [pid 24102] about to spawn!
> >>>>>>>> [pid 24104] starting up!
> >>>>>>>> [pid 24105] starting up!
> >>>>>>>> [loki:24105] *** Process received signal ***
> >>>>>>>> [loki:24105] Signal: Segmentation fault (11)
> >>>>>>>> [loki:24105] Signal code: Address not mapped (1)
> >>>>>>>> [loki:24105] Failing at address: 0x8
> >>>>>>>> [loki:24105] [ 0] /lib64/libpthread.so.0(+0xf870)[0x7f39aa76f870]
> >>>>>>>> [loki:24105] [ 1]
> >>>>>>>> /usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(ompi_proc_self+0x35)[0x7f39aa9d25b0]
> >>>>>>>> [loki:24105] [ 2]
> >>>>>>>> /usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(ompi_comm_init+0x68b)[0x7f39aa9b1b08]
> >>>>>>>> [loki:24105] [ 3] *** An error occurred in MPI_Init
> >>>>>>>> *** on a NULL communicator
> >>>>>>>> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now 
> >>>>>>>> abort,
> >>>>>>>> ***    and potentially your MPI job)
> >>>>>>>> [loki:24104] Local abort before MPI_INIT completed successfully; not 
> >>>>>>>> able to
> >>>>>>>> aggregate error messages, and not able to guarantee that all other 
> >>>>>>>> processes
> >>>>>>>> were killed!
> >>>>>>>> /usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(ompi_mpi_init+0xa90)[0x7f39aa9d7e8a]
> >>>>>>>> [loki:24105] [ 4]
> >>>>>>>> /usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(MPI_Init+0x1a0)[0x7f39aaa142ae]
> >>>>>>>> [loki:24105] [ 5] a.out[0x400d0c]
> >>>>>>>> [loki:24105] [ 6] 
> >>>>>>>> /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f39aa3d9b05]
> >>>>>>>> [loki:24105] [ 7] a.out[0x400bf9]
> >>>>>>>> [loki:24105] *** End of error message ***
> >>>>>>>> -------------------------------------------------------
> >>>>>>>> Child job 2 terminated normally, but 1 process returned
> >>>>>>>> a non-zero exit code.. Per user-direction, the job has been aborted.
> >>>>>>>> -------------------------------------------------------
> >>>>>>>> --------------------------------------------------------------------------
> >>>>>>>> mpiexec detected that one or more processes exited with non-zero 
> >>>>>>>> status, thus
> >>>>>>>> causing
> >>>>>>>> the job to be terminated. The first process to do so was:
> >>>>>>>>
> >>>>>>>> Process name: [[49560,2],0]
> >>>>>>>> Exit code:    1
> >>>>>>>> --------------------------------------------------------------------------
> >>>>>>>> loki spawn 157
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Hopefully, you will find out what happens. Please let me know, if I 
> >>>>>>>> can
> >>>>>>>> help you in any way.
> >>>>>>>>
> >>>>>>>> Kind regards
> >>>>>>>>
> >>>>>>>> Siegmar
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>> FWIW: I don?t know how many cores you have on your sockets, but if 
> >>>>>>>>> you
> >>>>>>>>> have 6 cores/socket, then your slot-list is equivalent to ??bind-to 
> >>>>>>>>> none?
> >>>>>>>>> as the slot-list applies to every process being launched
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>> On May 23, 2016, at 6:26 AM, Siegmar Gross
> >>>>>>>>>> <siegmar.gr...@informatik.hs-fulda.de
> >>>>>>>>>> <mailto:siegmar.gr...@informatik.hs-fulda.de>> wrote:
> >>>>>>>>>>
> >>>>>>>>>> Hi,
> >>>>>>>>>>
> >>>>>>>>>> I installed openmpi-1.10.3rc2 on my "SUSE Linux Enterprise Server
> >>>>>>>>>> 12 (x86_64)" with Sun C 5.13  and gcc-6.1.0. Unfortunately I get
> >>>>>>>>>> a segmentation fault for "--slot-list" for one of my small 
> >>>>>>>>>> programs.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> loki spawn 119 ompi_info | grep -e "OPAL repo revision:" -e "C 
> >>>>>>>>>> compiler
> >>>>>>>>>> absolute:"
> >>>>>>>>>>  OPAL repo revision: v1.10.2-201-gd23dda8
> >>>>>>>>>> C compiler absolute: /usr/local/gcc-6.1.0/bin/gcc
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> loki spawn 120 mpiexec -np 1 --host loki,loki,loki,loki,loki 
> >>>>>>>>>> spawn_master
> >>>>>>>>>>
> >>>>>>>>>> Parent process 0 running on loki
> >>>>>>>>>> I create 4 slave processes
> >>>>>>>>>>
> >>>>>>>>>> Parent process 0: tasks in MPI_COMM_WORLD:                    1
> >>>>>>>>>>              tasks in COMM_CHILD_PROCESSES local group:  1
> >>>>>>>>>>              tasks in COMM_CHILD_PROCESSES remote group: 4
> >>>>>>>>>>
> >>>>>>>>>> Slave process 0 of 4 running on loki
> >>>>>>>>>> Slave process 1 of 4 running on loki
> >>>>>>>>>> Slave process 2 of 4 running on loki
> >>>>>>>>>> spawn_slave 2: argv[0]: spawn_slave
> >>>>>>>>>> Slave process 3 of 4 running on loki
> >>>>>>>>>> spawn_slave 0: argv[0]: spawn_slave
> >>>>>>>>>> spawn_slave 1: argv[0]: spawn_slave
> >>>>>>>>>> spawn_slave 3: argv[0]: spawn_slave
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> loki spawn 121 mpiexec -np 1 --host loki --slot-list 0:0-5,1:0-5 
> >>>>>>>>>> spawn_master
> >>>>>>>>>>
> >>>>>>>>>> Parent process 0 running on loki
> >>>>>>>>>> I create 4 slave processes
> >>>>>>>>>>
> >>>>>>>>>> [loki:17326] *** Process received signal ***
> >>>>>>>>>> [loki:17326] Signal: Segmentation fault (11)
> >>>>>>>>>> [loki:17326] Signal code: Address not mapped (1)
> >>>>>>>>>> [loki:17326] Failing at address: 0x8
> >>>>>>>>>> [loki:17326] [ 0] /lib64/libpthread.so.0(+0xf870)[0x7f4e469b3870]
> >>>>>>>>>> [loki:17326] [ 1] *** An error occurred in MPI_Init
> >>>>>>>>>> *** on a NULL communicator
> >>>>>>>>>> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now 
> >>>>>>>>>> abort,
> >>>>>>>>>> ***    and potentially your MPI job)
> >>>>>>>>>> [loki:17324] Local abort before MPI_INIT completed successfully; 
> >>>>>>>>>> not able to
> >>>>>>>>>> aggregate error messages, and not able to guarantee that all other 
> >>>>>>>>>> processes
> >>>>>>>>>> were killed!
> >>>>>>>>>> /usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(ompi_proc_self+0x35)[0x7f4e46c165b0]
> >>>>>>>>>> [loki:17326] [ 2]
> >>>>>>>>>> /usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(ompi_comm_init+0x68b)[0x7f4e46bf5b08]
> >>>>>>>>>> [loki:17326] [ 3] *** An error occurred in MPI_Init
> >>>>>>>>>> *** on a NULL communicator
> >>>>>>>>>> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now 
> >>>>>>>>>> abort,
> >>>>>>>>>> ***    and potentially your MPI job)
> >>>>>>>>>> [loki:17325] Local abort before MPI_INIT completed successfully; 
> >>>>>>>>>> not able to
> >>>>>>>>>> aggregate error messages, and not able to guarantee that all other 
> >>>>>>>>>> processes
> >>>>>>>>>> were killed!
> >>>>>>>>>> /usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(ompi_mpi_init+0xa90)[0x7f4e46c1be8a]
> >>>>>>>>>> [loki:17326] [ 4]
> >>>>>>>>>> /usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(MPI_Init+0x180)[0x7f4e46c5828e]
> >>>>>>>>>> [loki:17326] [ 5] spawn_slave[0x40097e]
> >>>>>>>>>> [loki:17326] [ 6] 
> >>>>>>>>>> /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f4e4661db05]
> >>>>>>>>>> [loki:17326] [ 7] spawn_slave[0x400a54]
> >>>>>>>>>> [loki:17326] *** End of error message ***
> >>>>>>>>>> -------------------------------------------------------
> >>>>>>>>>> Child job 2 terminated normally, but 1 process returned
> >>>>>>>>>> a non-zero exit code.. Per user-direction, the job has been 
> >>>>>>>>>> aborted.
> >>>>>>>>>> -------------------------------------------------------
> >>>>>>>>>> --------------------------------------------------------------------------
> >>>>>>>>>> mpiexec detected that one or more processes exited with non-zero 
> >>>>>>>>>> status,
> >>>>>>>>>> thus causing
> >>>>>>>>>> the job to be terminated. The first process to do so was:
> >>>>>>>>>>
> >>>>>>>>>> Process name: [[56340,2],0]
> >>>>>>>>>> Exit code:    1
> >>>>>>>>>> --------------------------------------------------------------------------
> >>>>>>>>>> loki spawn 122
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> I would be grateful, if somebody can fix the problem. Thank you
> >>>>>>>>>> very much for any help in advance.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Kind regards
> >>>>>>>>>>
> >>>>>>>>>> Siegmar
> >>>>>>>>>> _______________________________________________
> >>>>>>>>>> users mailing list
> >>>>>>>>>> us...@open-mpi.org <mailto:us...@open-mpi.org>
> >>>>>>>>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>>>>>> Link to this post:
> >>>>>>>>>> http://www.open-mpi.org/community/lists/users/2016/05/29281.php
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> _______________________________________________
> >>>>>>>>> users mailing list
> >>>>>>>>> us...@open-mpi.org <mailto:us...@open-mpi.org>
> >>>>>>>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>>>>> Link to this
> >>>>>>>>> post: 
> >>>>>>>>> http://www.open-mpi.org/community/lists/users/2016/05/29284.php
> >>>>>>>>>
> >>>>>>>> <simple_spawn_modified.c>_______________________________________________
> >>>>>>>> users mailing list
> >>>>>>>> us...@open-mpi.org <mailto:us...@open-mpi.org>
> >>>>>>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>>>> Link to this post: 
> >>>>>>>> http://www.open-mpi.org/community/lists/users/2016/05/29300.php
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> _______________________________________________
> >>>>>>> users mailing list
> >>>>>>> us...@open-mpi.org
> >>>>>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>>> Link to this post: 
> >>>>>>> http://www.open-mpi.org/community/lists/users/2016/05/29301.php
> >>>>>>>
> >>>>>> _______________________________________________
> >>>>>> users mailing list
> >>>>>> us...@open-mpi.org
> >>>>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>> Link to this post: 
> >>>>>> http://www.open-mpi.org/community/lists/users/2016/05/29304.php
> >>>>>
> >>>>> _______________________________________________
> >>>>> users mailing list
> >>>>> us...@open-mpi.org
> >>>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>> Link to this post: 
> >>>>> http://www.open-mpi.org/community/lists/users/2016/05/29307.php
> >>>>>
> >>>> _______________________________________________
> >>>> users mailing list
> >>>> us...@open-mpi.org
> >>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>> Link to this post: 
> >>>> http://www.open-mpi.org/community/lists/users/2016/05/29308.php
> >>>
> >>> _______________________________________________
> >>> users mailing list
> >>> us...@open-mpi.org
> >>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> >>> Link to this post: 
> >>> http://www.open-mpi.org/community/lists/users/2016/05/29309.php
> >>>
> >> _______________________________________________
> >> users mailing list
> >> us...@open-mpi.org
> >> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> >> Link to this post: 
> >> http://www.open-mpi.org/community/lists/users/2016/05/29315.php
> >
> > _______________________________________________
> > users mailing list
> > us...@open-mpi.org
> > Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> > Link to this post: 
> > http://www.open-mpi.org/community/lists/users/2016/05/29316.php
> >
> 
> 
> 
> ------------------------------
> 
> Message: 3
> Date: Fri, 27 May 2016 09:14:42 +0000
> From: "Marco D'Amico" <marco.damic...@gmail.com>
> To: us...@open-mpi.org
> Subject: [OMPI users] OpenMPI virtualization aware
> Message-ID:
>    <CABi-01XH+vdi2egBD=knen_cyxpecg0j-+3rtvnfnc6mtd+...@mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
> 
> Hi I'm recently investigating in Virtualization used in HPC field, and I
> found out that MVAPICH has a "Virtualization aware" version, that permit to
> overcome the big latencies problems of using a Virtualization environment
> for HPC.
> 
> My question is if there is any similar efforts in OpenMPI, since I would
> eventually contribute in it.
> 
> Best regards,
> Marco D'Amico
> -------------- next part --------------
> HTML attachment scrubbed and removed
> 
> ------------------------------
> 
> Message: 4
> Date: Fri, 27 May 2016 06:45:05 -0700
> From: Ralph Castain <r...@open-mpi.org>
> To: Open MPI Users <us...@open-mpi.org>
> Subject: Re: [OMPI users] OpenMPI virtualization aware
> Message-ID: <bbeb8e66-40b0-4688-8284-2113252e1...@open-mpi.org>
> Content-Type: text/plain; charset="utf-8"
> 
> Hi Marco
> 
> OMPI has integrated support for the Singularity container:
> 
> http://singularity.lbl.gov/index.html <http://singularity.lbl.gov/index.html>
> 
> https://groups.google.com/a/lbl.gov/forum/#!forum/singularity 
> <https://groups.google.com/a/lbl.gov/forum/#!forum/singularity>
> 
> It is in OMPI master now, and an early version is in 2.0 - the full 
> integration will be in 2.1. Singularity is undergoing changes for its 2.0 
> release (so we?ll need to do some updating of the OMPI integration), and 
> there is still plenty that can be done to further optimize its integration - 
> so contributions would be welcome!
> 
> Ralph
> 
> 
> 
> > On May 27, 2016, at 2:14 AM, Marco D'Amico <marco.damic...@gmail.com> wrote:
> > 
> > Hi I'm recently investigating in Virtualization used in HPC field, and I 
> > found out that MVAPICH has a "Virtualization aware" version, that permit to 
> > overcome the big latencies problems of using a Virtualization environment 
> > for HPC.
> > 
> > My question is if there is any similar efforts in OpenMPI, since I would 
> > eventually contribute in it.
> > 
> > Best regards,
> > Marco D'Amico
> > _______________________________________________
> > users mailing list
> > us...@open-mpi.org
> > Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> > Link to this post: 
> > http://www.open-mpi.org/community/lists/users/2016/05/29320.php
> 
> -------------- next part --------------
> HTML attachment scrubbed and removed
> 
> ------------------------------
> 
> Subject: Digest Footer
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> https://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> ------------------------------
> 
> End of users Digest, Vol 3514, Issue 1
> **************************************
> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2016/06/29341.php


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/


------------------------------

Subject: Digest Footer

_______________________________________________
users mailing list
us...@open-mpi.org
https://www.open-mpi.org/mailman/listinfo.cgi/users

------------------------------

End of users Digest, Vol 3518, Issue 2
**************************************


Reply via email to