That worked! i.e with the changed you proposed the code gives the right result.
That was efficient work, thank you Gilles :) Best wishes, Peter ----- Original Message ----- > Thanks Peter, > that is quite unexpected ... > let s try an other workaround, can you replace > integer :: comm_group > with > integer :: comm_group, comm_tmp > and > call MPI_COMM_SPLIT(comm, irank*2/num_procs, irank, comm_group, ierr) > with > call MPI_COMM_SPLIT(comm, irank*2/num_procs, irank, comm_tmp, ierr) > if (irank < (num_procs/2)) then > comm_group = comm_tmp > else > call MPI_Comm_dup(comm_tmp, comm_group, ierr) > endif > if it works, I will make a fix tomorrow when I can access my workstation. > if not, can you please run > mpirun --mca osc_base_verbose 100 ... > and post the output ? > I will then try to reproduce the issue and investigate it > Cheers, > Gilles > On Tuesday, February 2, 2016, Peter Wind < peter.w...@met.no > wrote: > > Thanks Gilles, > > > I get the following output (I guess it is not what you wanted?). > > > Peter > > > $ mpirun --mca osc pt2pt -np 4 a.out > > > -------------------------------------------------------------------------- > > > A requested component was not found, or was unable to be opened. This > > > means that this component is either not installed or is unable to be > > > used on your system (e.g., sometimes this means that shared libraries > > > that the component requires are unable to be found/loaded). Note that > > > Open MPI stopped checking at the first component that it did not find. > > > Host: stallo-2.local > > > Framework: osc > > > Component: pt2pt > > > -------------------------------------------------------------------------- > > > -------------------------------------------------------------------------- > > > It looks like MPI_INIT failed for some reason; your parallel process is > > > likely to abort. There are many reasons that a parallel process can > > > fail during MPI_INIT; some of which are due to configuration or environment > > > problems. This failure appears to be an internal failure; here's some > > > additional information (which may only be relevant to an Open MPI > > > developer): > > > ompi_osc_base_open() failed > > > --> Returned "Not found" (-13) instead of "Success" (0) > > > -------------------------------------------------------------------------- > > > *** An error occurred in MPI_Init > > > *** on a NULL communicator > > > *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, > > > *** and potentially your MPI job) > > > [stallo-2.local:38415] Local abort before MPI_INIT completed successfully; > > not able to aggregate error messages, and not able to guarantee that all > > other processes were killed! > > > *** An error occurred in MPI_Init > > > *** on a NULL communicator > > > *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, > > > *** and potentially your MPI job) > > > [stallo-2.local:38418] Local abort before MPI_INIT completed successfully; > > not able to aggregate error messages, and not able to guarantee that all > > other processes were killed! > > > *** An error occurred in MPI_Init > > > *** on a NULL communicator > > > *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, > > > *** and potentially your MPI job) > > > [stallo-2.local:38416] Local abort before MPI_INIT completed successfully; > > not able to aggregate error messages, and not able to guarantee that all > > other processes were killed! > > > *** An error occurred in MPI_Init > > > *** on a NULL communicator > > > *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, > > > *** and potentially your MPI job) > > > [stallo-2.local:38417] Local abort before MPI_INIT completed successfully; > > not able to aggregate error messages, and not able to guarantee that all > > other processes were killed! > > > ------------------------------------------------------- > > > Primary job terminated normally, but 1 process returned > > > a non-zero exit code.. Per user-direction, the job has been aborted. > > > ------------------------------------------------------- > > > -------------------------------------------------------------------------- > > > mpirun detected that one or more processes exited with non-zero status, > > thus > > causing > > > the job to be terminated. The first process to do so was: > > > Process name: [[52507,1],0] > > > Exit code: 1 > > > -------------------------------------------------------------------------- > > > [stallo-2.local:38410] 3 more processes have sent help message > > help-mca-base.txt / find-available:not-valid > > > [stallo-2.local:38410] Set MCA parameter "orte_base_help_aggregate" to 0 to > > see all help / error messages > > > [stallo-2.local:38410] 2 more processes have sent help message > > help-mpi-runtime / mpi_init:startup:internal-failure > > > > Peter, > > > > > > at first glance, your test program looks correct. > > > > > > can you please try to run > > > > > > mpirun --mca osc pt2pt -np 4 ... > > > > > > I might have identified a bug with the sm osc component. > > > > > > Cheers, > > > > > > Gilles > > > > > > On Tuesday, February 2, 2016, Peter Wind < peter.w...@met.no > wrote: > > > > > > > Enclosed is a short (< 100 lines) fortran code example that uses shared > > > > memory. > > > > > > > > > > It seems to me it behaves wrongly if openmpi is used. > > > > > > > > > > Compiled with SGI/mpt , it gives the right result. > > > > > > > > > > To fail, the code must be run on a single node. > > > > > > > > > > It creates two groups of 2 processes each. Within each group memory is > > > > shared. > > > > > > > > > > The error is that the two groups get the same memory allocated, but > > > > they > > > > should not. > > > > > > > > > > Tested with openmpi 1.8.4, 1.8.5, 1.10.2 and gfortran, intel 13.0, > > > > intel > > > > 14.0 > > > > > > > > > > all fail. > > > > > > > > > > The call: > > > > > > > > > > call MPI_Win_allocate_shared(win_size, disp_unit, MPI_INFO_NULL, > > > > comm_group, > > > > cp1, win, ierr) > > > > > > > > > > Should allocate memory only within the group. But when the other group > > > > allocates memory, the pointers from the two groups point to the same > > > > address > > > > in memory. > > > > > > > > > > Could you please confirm that this is the wrong behaviour? > > > > > > > > > > Best regards, > > > > > > > > > > Peter Wind > > > > > > > > > _______________________________________________ > > > > > > users mailing list > > > > > > us...@open-mpi.org > > > > > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > > Link to this post: > > > http://www.open-mpi.org/community/lists/users/2016/02/28429.php > > > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2016/02/28431.php