Thanks Peter,

this is just a workaround for a bug we just identified, the fix will come
soon

Cheers,

Gilles

On Tuesday, February 2, 2016, Peter Wind <peter.w...@met.no> wrote:

> That worked!
>
> i.e with the changed you proposed the code gives the right result.
>
> That was efficient work, thank you Gilles :)
>
> Best wishes,
> Peter
>
>
> ------------------------------
>
> Thanks Peter,
>
> that is quite unexpected ...
>
> let s try an other workaround, can you replace
>
> integer            :: comm_group
>
> with
>
> integer            :: comm_group, comm_tmp
>
>
> and
>
> call MPI_COMM_SPLIT(comm, irank*2/num_procs, irank, comm_group, ierr)
>
>
> with
>
>
> call MPI_COMM_SPLIT(comm, irank*2/num_procs, irank, comm_tmp, ierr)
>
> if (irank < (num_procs/2)) then
>
>     comm_group = comm_tmp
>
> else
>
>     call MPI_Comm_dup(comm_tmp, comm_group, ierr)
>
> endif
>
>
>
> if it works, I will make a fix tomorrow when I can access my workstation.
> if not, can you please run
> mpirun --mca osc_base_verbose 100 ...
> and post the output ?
>
> I will then try to reproduce the issue and investigate it
>
> Cheers,
>
> Gilles
>
> On Tuesday, February 2, 2016, Peter Wind <peter.w...@met.no> wrote:
>
>> Thanks Gilles,
>>
>> I get the following output (I guess it is not what you wanted?).
>>
>> Peter
>>
>>
>> $ mpirun --mca osc pt2pt -np 4 a.out
>> --------------------------------------------------------------------------
>> A requested component was not found, or was unable to be opened.  This
>> means that this component is either not installed or is unable to be
>> used on your system (e.g., sometimes this means that shared libraries
>> that the component requires are unable to be found/loaded).  Note that
>> Open MPI stopped checking at the first component that it did not find.
>>
>> Host:      stallo-2.local
>> Framework: osc
>> Component: pt2pt
>> --------------------------------------------------------------------------
>> --------------------------------------------------------------------------
>> It looks like MPI_INIT failed for some reason; your parallel process is
>> likely to abort.  There are many reasons that a parallel process can
>> fail during MPI_INIT; some of which are due to configuration or
>> environment
>> problems.  This failure appears to be an internal failure; here's some
>> additional information (which may only be relevant to an Open MPI
>> developer):
>>
>>   ompi_osc_base_open() failed
>>   --> Returned "Not found" (-13) instead of "Success" (0)
>> --------------------------------------------------------------------------
>> *** An error occurred in MPI_Init
>> *** on a NULL communicator
>> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
>> ***    and potentially your MPI job)
>> [stallo-2.local:38415] Local abort before MPI_INIT completed
>> successfully; not able to aggregate error messages, and not able to
>> guarantee that all other processes were killed!
>> *** An error occurred in MPI_Init
>> *** on a NULL communicator
>> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
>> ***    and potentially your MPI job)
>> [stallo-2.local:38418] Local abort before MPI_INIT completed
>> successfully; not able to aggregate error messages, and not able to
>> guarantee that all other processes were killed!
>> *** An error occurred in MPI_Init
>> *** on a NULL communicator
>> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
>> ***    and potentially your MPI job)
>> [stallo-2.local:38416] Local abort before MPI_INIT completed
>> successfully; not able to aggregate error messages, and not able to
>> guarantee that all other processes were killed!
>> *** An error occurred in MPI_Init
>> *** on a NULL communicator
>> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
>> ***    and potentially your MPI job)
>> [stallo-2.local:38417] Local abort before MPI_INIT completed
>> successfully; not able to aggregate error messages, and not able to
>> guarantee that all other processes were killed!
>> -------------------------------------------------------
>> Primary job  terminated normally, but 1 process returned
>> a non-zero exit code.. Per user-direction, the job has been aborted.
>> -------------------------------------------------------
>> --------------------------------------------------------------------------
>> mpirun detected that one or more processes exited with non-zero status,
>> thus causing
>> the job to be terminated. The first process to do so was:
>>
>>   Process name: [[52507,1],0]
>>   Exit code:    1
>> --------------------------------------------------------------------------
>> [stallo-2.local:38410] 3 more processes have sent help message
>> help-mca-base.txt / find-available:not-valid
>> [stallo-2.local:38410] Set MCA parameter "orte_base_help_aggregate" to 0
>> to see all help / error messages
>> [stallo-2.local:38410] 2 more processes have sent help message
>> help-mpi-runtime / mpi_init:startup:internal-failure
>>
>>
>> ------------------------------
>>
>> Peter,
>>
>> at first glance, your test program looks correct.
>>
>> can you please try to run
>> mpirun --mca osc pt2pt -np 4 ...
>>
>> I  might have identified a bug with the sm osc component.
>>
>> Cheers,
>>
>> Gilles
>>
>> On Tuesday, February 2, 2016, Peter Wind <peter.w...@met.no> wrote:
>>
>>> Enclosed is a short (< 100 lines) fortran code example that uses shared
>>> memory.
>>> It seems to me it behaves wrongly if openmpi is used.
>>> Compiled with SGI/mpt , it gives the right result.
>>>
>>> To fail, the code must be run on a single node.
>>> It creates two groups of 2 processes each. Within each group memory is
>>> shared.
>>> The error is that the two groups get the same memory allocated, but they
>>> should not.
>>>
>>> Tested with openmpi 1.8.4, 1.8.5, 1.10.2 and gfortran, intel 13.0, intel
>>> 14.0
>>> all fail.
>>>
>>> The call:
>>>    call MPI_Win_allocate_shared(win_size, disp_unit, MPI_INFO_NULL,
>>> comm_group, cp1, win, ierr)
>>>
>>> Should allocate memory only within the group. But when the other group
>>> allocates memory, the pointers from the two groups point to the same
>>> address in memory.
>>>
>>> Could you please confirm that this is the wrong behaviour?
>>>
>>> Best regards,
>>> Peter Wind
>>
>>
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2016/02/28429.php
>>
>>
>>
> _______________________________________________
> users mailing list
> us...@open-mpi.org <javascript:_e(%7B%7D,'cvml','us...@open-mpi.org');>
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2016/02/28431.php
>
>
>

Reply via email to