Thanks Peter, this is just a workaround for a bug we just identified, the fix will come soon
Cheers, Gilles On Tuesday, February 2, 2016, Peter Wind <peter.w...@met.no> wrote: > That worked! > > i.e with the changed you proposed the code gives the right result. > > That was efficient work, thank you Gilles :) > > Best wishes, > Peter > > > ------------------------------ > > Thanks Peter, > > that is quite unexpected ... > > let s try an other workaround, can you replace > > integer :: comm_group > > with > > integer :: comm_group, comm_tmp > > > and > > call MPI_COMM_SPLIT(comm, irank*2/num_procs, irank, comm_group, ierr) > > > with > > > call MPI_COMM_SPLIT(comm, irank*2/num_procs, irank, comm_tmp, ierr) > > if (irank < (num_procs/2)) then > > comm_group = comm_tmp > > else > > call MPI_Comm_dup(comm_tmp, comm_group, ierr) > > endif > > > > if it works, I will make a fix tomorrow when I can access my workstation. > if not, can you please run > mpirun --mca osc_base_verbose 100 ... > and post the output ? > > I will then try to reproduce the issue and investigate it > > Cheers, > > Gilles > > On Tuesday, February 2, 2016, Peter Wind <peter.w...@met.no> wrote: > >> Thanks Gilles, >> >> I get the following output (I guess it is not what you wanted?). >> >> Peter >> >> >> $ mpirun --mca osc pt2pt -np 4 a.out >> -------------------------------------------------------------------------- >> A requested component was not found, or was unable to be opened. This >> means that this component is either not installed or is unable to be >> used on your system (e.g., sometimes this means that shared libraries >> that the component requires are unable to be found/loaded). Note that >> Open MPI stopped checking at the first component that it did not find. >> >> Host: stallo-2.local >> Framework: osc >> Component: pt2pt >> -------------------------------------------------------------------------- >> -------------------------------------------------------------------------- >> It looks like MPI_INIT failed for some reason; your parallel process is >> likely to abort. There are many reasons that a parallel process can >> fail during MPI_INIT; some of which are due to configuration or >> environment >> problems. This failure appears to be an internal failure; here's some >> additional information (which may only be relevant to an Open MPI >> developer): >> >> ompi_osc_base_open() failed >> --> Returned "Not found" (-13) instead of "Success" (0) >> -------------------------------------------------------------------------- >> *** An error occurred in MPI_Init >> *** on a NULL communicator >> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, >> *** and potentially your MPI job) >> [stallo-2.local:38415] Local abort before MPI_INIT completed >> successfully; not able to aggregate error messages, and not able to >> guarantee that all other processes were killed! >> *** An error occurred in MPI_Init >> *** on a NULL communicator >> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, >> *** and potentially your MPI job) >> [stallo-2.local:38418] Local abort before MPI_INIT completed >> successfully; not able to aggregate error messages, and not able to >> guarantee that all other processes were killed! >> *** An error occurred in MPI_Init >> *** on a NULL communicator >> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, >> *** and potentially your MPI job) >> [stallo-2.local:38416] Local abort before MPI_INIT completed >> successfully; not able to aggregate error messages, and not able to >> guarantee that all other processes were killed! >> *** An error occurred in MPI_Init >> *** on a NULL communicator >> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, >> *** and potentially your MPI job) >> [stallo-2.local:38417] Local abort before MPI_INIT completed >> successfully; not able to aggregate error messages, and not able to >> guarantee that all other processes were killed! >> ------------------------------------------------------- >> Primary job terminated normally, but 1 process returned >> a non-zero exit code.. Per user-direction, the job has been aborted. >> ------------------------------------------------------- >> -------------------------------------------------------------------------- >> mpirun detected that one or more processes exited with non-zero status, >> thus causing >> the job to be terminated. The first process to do so was: >> >> Process name: [[52507,1],0] >> Exit code: 1 >> -------------------------------------------------------------------------- >> [stallo-2.local:38410] 3 more processes have sent help message >> help-mca-base.txt / find-available:not-valid >> [stallo-2.local:38410] Set MCA parameter "orte_base_help_aggregate" to 0 >> to see all help / error messages >> [stallo-2.local:38410] 2 more processes have sent help message >> help-mpi-runtime / mpi_init:startup:internal-failure >> >> >> ------------------------------ >> >> Peter, >> >> at first glance, your test program looks correct. >> >> can you please try to run >> mpirun --mca osc pt2pt -np 4 ... >> >> I might have identified a bug with the sm osc component. >> >> Cheers, >> >> Gilles >> >> On Tuesday, February 2, 2016, Peter Wind <peter.w...@met.no> wrote: >> >>> Enclosed is a short (< 100 lines) fortran code example that uses shared >>> memory. >>> It seems to me it behaves wrongly if openmpi is used. >>> Compiled with SGI/mpt , it gives the right result. >>> >>> To fail, the code must be run on a single node. >>> It creates two groups of 2 processes each. Within each group memory is >>> shared. >>> The error is that the two groups get the same memory allocated, but they >>> should not. >>> >>> Tested with openmpi 1.8.4, 1.8.5, 1.10.2 and gfortran, intel 13.0, intel >>> 14.0 >>> all fail. >>> >>> The call: >>> call MPI_Win_allocate_shared(win_size, disp_unit, MPI_INFO_NULL, >>> comm_group, cp1, win, ierr) >>> >>> Should allocate memory only within the group. But when the other group >>> allocates memory, the pointers from the two groups point to the same >>> address in memory. >>> >>> Could you please confirm that this is the wrong behaviour? >>> >>> Best regards, >>> Peter Wind >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2016/02/28429.php >> >> >> > _______________________________________________ > users mailing list > us...@open-mpi.org <javascript:_e(%7B%7D,'cvml','us...@open-mpi.org');> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2016/02/28431.php > > >