Thanks Peter,

that is quite unexpected ...

let s try an other workaround, can you replace

integer            :: comm_group

with

integer            :: comm_group, comm_tmp


and

call MPI_COMM_SPLIT(comm, irank*2/num_procs, irank, comm_group, ierr)


with


call MPI_COMM_SPLIT(comm, irank*2/num_procs, irank, comm_tmp, ierr)

if (irank < (num_procs/2)) then

    comm_group = comm_tmp

else

    call MPI_Comm_dup(comm_tmp, comm_group, ierr)

endif



if it works, I will make a fix tomorrow when I can access my workstation.
if not, can you please run
mpirun --mca osc_base_verbose 100 ...
and post the output ?

I will then try to reproduce the issue and investigate it

Cheers,

Gilles

On Tuesday, February 2, 2016, Peter Wind <peter.w...@met.no
<javascript:_e(%7B%7D,'cvml','peter.w...@met.no');>> wrote:

> Thanks Gilles,
>
> I get the following output (I guess it is not what you wanted?).
>
> Peter
>
>
> $ mpirun --mca osc pt2pt -np 4 a.out
> --------------------------------------------------------------------------
> A requested component was not found, or was unable to be opened.  This
> means that this component is either not installed or is unable to be
> used on your system (e.g., sometimes this means that shared libraries
> that the component requires are unable to be found/loaded).  Note that
> Open MPI stopped checking at the first component that it did not find.
>
> Host:      stallo-2.local
> Framework: osc
> Component: pt2pt
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> It looks like MPI_INIT failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during MPI_INIT; some of which are due to configuration or environment
> problems.  This failure appears to be an internal failure; here's some
> additional information (which may only be relevant to an Open MPI
> developer):
>
>   ompi_osc_base_open() failed
>   --> Returned "Not found" (-13) instead of "Success" (0)
> --------------------------------------------------------------------------
> *** An error occurred in MPI_Init
> *** on a NULL communicator
> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
> ***    and potentially your MPI job)
> [stallo-2.local:38415] Local abort before MPI_INIT completed successfully;
> not able to aggregate error messages, and not able to guarantee that all
> other processes were killed!
> *** An error occurred in MPI_Init
> *** on a NULL communicator
> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
> ***    and potentially your MPI job)
> [stallo-2.local:38418] Local abort before MPI_INIT completed successfully;
> not able to aggregate error messages, and not able to guarantee that all
> other processes were killed!
> *** An error occurred in MPI_Init
> *** on a NULL communicator
> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
> ***    and potentially your MPI job)
> [stallo-2.local:38416] Local abort before MPI_INIT completed successfully;
> not able to aggregate error messages, and not able to guarantee that all
> other processes were killed!
> *** An error occurred in MPI_Init
> *** on a NULL communicator
> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
> ***    and potentially your MPI job)
> [stallo-2.local:38417] Local abort before MPI_INIT completed successfully;
> not able to aggregate error messages, and not able to guarantee that all
> other processes were killed!
> -------------------------------------------------------
> Primary job  terminated normally, but 1 process returned
> a non-zero exit code.. Per user-direction, the job has been aborted.
> -------------------------------------------------------
> --------------------------------------------------------------------------
> mpirun detected that one or more processes exited with non-zero status,
> thus causing
> the job to be terminated. The first process to do so was:
>
>   Process name: [[52507,1],0]
>   Exit code:    1
> --------------------------------------------------------------------------
> [stallo-2.local:38410] 3 more processes have sent help message
> help-mca-base.txt / find-available:not-valid
> [stallo-2.local:38410] Set MCA parameter "orte_base_help_aggregate" to 0
> to see all help / error messages
> [stallo-2.local:38410] 2 more processes have sent help message
> help-mpi-runtime / mpi_init:startup:internal-failure
>
>
> ------------------------------
>
> Peter,
>
> at first glance, your test program looks correct.
>
> can you please try to run
> mpirun --mca osc pt2pt -np 4 ...
>
> I  might have identified a bug with the sm osc component.
>
> Cheers,
>
> Gilles
>
> On Tuesday, February 2, 2016, Peter Wind <peter.w...@met.no> wrote:
>
>> Enclosed is a short (< 100 lines) fortran code example that uses shared
>> memory.
>> It seems to me it behaves wrongly if openmpi is used.
>> Compiled with SGI/mpt , it gives the right result.
>>
>> To fail, the code must be run on a single node.
>> It creates two groups of 2 processes each. Within each group memory is
>> shared.
>> The error is that the two groups get the same memory allocated, but they
>> should not.
>>
>> Tested with openmpi 1.8.4, 1.8.5, 1.10.2 and gfortran, intel 13.0, intel
>> 14.0
>> all fail.
>>
>> The call:
>>    call MPI_Win_allocate_shared(win_size, disp_unit, MPI_INFO_NULL,
>> comm_group, cp1, win, ierr)
>>
>> Should allocate memory only within the group. But when the other group
>> allocates memory, the pointers from the two groups point to the same
>> address in memory.
>>
>> Could you please confirm that this is the wrong behaviour?
>>
>> Best regards,
>> Peter Wind
>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2016/02/28429.php
>
>
>

Reply via email to