That worked! 

i.e with the changed you proposed the code gives the right result. 

That was efficient work, thank you Gilles :) 

Best wishes, 
Peter 

----- Original Message -----

> Thanks Peter,

> that is quite unexpected ...

> let s try an other workaround, can you replace

> integer            :: comm_group
> with

> integer            :: comm_group, comm_tmp

> and
> call MPI_COMM_SPLIT(comm, irank*2/num_procs, irank, comm_group, ierr)

> with

> call MPI_COMM_SPLIT(comm, irank*2/num_procs, irank, comm_tmp, ierr)
> if (irank < (num_procs/2)) then
> comm_group = comm_tmp
> else
> call MPI_Comm_dup(comm_tmp, comm_group, ierr)
> endif

> if it works, I will make a fix tomorrow when I can access my workstation.
> if not, can you please run
> mpirun --mca osc_base_verbose 100 ...
> and post the output ?

> I will then try to reproduce the issue and investigate it

> Cheers,

> Gilles

> On Tuesday, February 2, 2016, Peter Wind < peter.w...@met.no > wrote:

> > Thanks Gilles,
> 

> > I get the following output (I guess it is not what you wanted?).
> 

> > Peter
> 

> > $ mpirun --mca osc pt2pt -np 4 a.out
> 
> > --------------------------------------------------------------------------
> 
> > A requested component was not found, or was unable to be opened. This
> 
> > means that this component is either not installed or is unable to be
> 
> > used on your system (e.g., sometimes this means that shared libraries
> 
> > that the component requires are unable to be found/loaded). Note that
> 
> > Open MPI stopped checking at the first component that it did not find.
> 

> > Host: stallo-2.local
> 
> > Framework: osc
> 
> > Component: pt2pt
> 
> > --------------------------------------------------------------------------
> 
> > --------------------------------------------------------------------------
> 
> > It looks like MPI_INIT failed for some reason; your parallel process is
> 
> > likely to abort. There are many reasons that a parallel process can
> 
> > fail during MPI_INIT; some of which are due to configuration or environment
> 
> > problems. This failure appears to be an internal failure; here's some
> 
> > additional information (which may only be relevant to an Open MPI
> 
> > developer):
> 

> > ompi_osc_base_open() failed
> 
> > --> Returned "Not found" (-13) instead of "Success" (0)
> 
> > --------------------------------------------------------------------------
> 
> > *** An error occurred in MPI_Init
> 
> > *** on a NULL communicator
> 
> > *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
> 
> > *** and potentially your MPI job)
> 
> > [stallo-2.local:38415] Local abort before MPI_INIT completed successfully;
> > not able to aggregate error messages, and not able to guarantee that all
> > other processes were killed!
> 
> > *** An error occurred in MPI_Init
> 
> > *** on a NULL communicator
> 
> > *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
> 
> > *** and potentially your MPI job)
> 
> > [stallo-2.local:38418] Local abort before MPI_INIT completed successfully;
> > not able to aggregate error messages, and not able to guarantee that all
> > other processes were killed!
> 
> > *** An error occurred in MPI_Init
> 
> > *** on a NULL communicator
> 
> > *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
> 
> > *** and potentially your MPI job)
> 
> > [stallo-2.local:38416] Local abort before MPI_INIT completed successfully;
> > not able to aggregate error messages, and not able to guarantee that all
> > other processes were killed!
> 
> > *** An error occurred in MPI_Init
> 
> > *** on a NULL communicator
> 
> > *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
> 
> > *** and potentially your MPI job)
> 
> > [stallo-2.local:38417] Local abort before MPI_INIT completed successfully;
> > not able to aggregate error messages, and not able to guarantee that all
> > other processes were killed!
> 
> > -------------------------------------------------------
> 
> > Primary job terminated normally, but 1 process returned
> 
> > a non-zero exit code.. Per user-direction, the job has been aborted.
> 
> > -------------------------------------------------------
> 
> > --------------------------------------------------------------------------
> 
> > mpirun detected that one or more processes exited with non-zero status,
> > thus
> > causing
> 
> > the job to be terminated. The first process to do so was:
> 

> > Process name: [[52507,1],0]
> 
> > Exit code: 1
> 
> > --------------------------------------------------------------------------
> 
> > [stallo-2.local:38410] 3 more processes have sent help message
> > help-mca-base.txt / find-available:not-valid
> 
> > [stallo-2.local:38410] Set MCA parameter "orte_base_help_aggregate" to 0 to
> > see all help / error messages
> 
> > [stallo-2.local:38410] 2 more processes have sent help message
> > help-mpi-runtime / mpi_init:startup:internal-failure
> 

> > > Peter,
> > 
> 

> > > at first glance, your test program looks correct.
> > 
> 

> > > can you please try to run
> > 
> 
> > > mpirun --mca osc pt2pt -np 4 ...
> > 
> 

> > > I might have identified a bug with the sm osc component.
> > 
> 

> > > Cheers,
> > 
> 

> > > Gilles
> > 
> 

> > > On Tuesday, February 2, 2016, Peter Wind < peter.w...@met.no > wrote:
> > 
> 

> > > > Enclosed is a short (< 100 lines) fortran code example that uses shared
> > > > memory.
> > > 
> > 
> 
> > > > It seems to me it behaves wrongly if openmpi is used.
> > > 
> > 
> 
> > > > Compiled with SGI/mpt , it gives the right result.
> > > 
> > 
> 

> > > > To fail, the code must be run on a single node.
> > > 
> > 
> 
> > > > It creates two groups of 2 processes each. Within each group memory is
> > > > shared.
> > > 
> > 
> 
> > > > The error is that the two groups get the same memory allocated, but
> > > > they
> > > > should not.
> > > 
> > 
> 

> > > > Tested with openmpi 1.8.4, 1.8.5, 1.10.2 and gfortran, intel 13.0,
> > > > intel
> > > > 14.0
> > > 
> > 
> 
> > > > all fail.
> > > 
> > 
> 

> > > > The call:
> > > 
> > 
> 
> > > > call MPI_Win_allocate_shared(win_size, disp_unit, MPI_INFO_NULL,
> > > > comm_group,
> > > > cp1, win, ierr)
> > > 
> > 
> 

> > > > Should allocate memory only within the group. But when the other group
> > > > allocates memory, the pointers from the two groups point to the same
> > > > address
> > > > in memory.
> > > 
> > 
> 

> > > > Could you please confirm that this is the wrong behaviour?
> > > 
> > 
> 

> > > > Best regards,
> > > 
> > 
> 
> > > > Peter Wind
> > > 
> > 
> 
> > > _______________________________________________
> > 
> 
> > > users mailing list
> > 
> 
> > > us...@open-mpi.org
> > 
> 
> > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> > 
> 
> > > Link to this post:
> > > http://www.open-mpi.org/community/lists/users/2016/02/28429.php
> > 
> 

> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2016/02/28431.php

Reply via email to