I really need to update that wording. It has been awhile and the code seems to 
have stabilized. It’s quite safe to use and supports some of the latest kernel 
versions.

-Nathan

> On Nov 13, 2018, at 11:06 PM, Bert Wesarg via users 
> <users@lists.open-mpi.org> wrote:
> 
> Dear Takahiro,
> On Wed, Nov 14, 2018 at 5:38 AM Kawashima, Takahiro
> <t-kawash...@jp.fujitsu.com> wrote:
>> 
>> XPMEM moved to GitLab.
>> 
>> https://gitlab.com/hjelmn/xpmem
> 
> the first words from the README aren't very pleasant to read:
> 
> This is an experimental version of XPMEM based on a version provided by
> Cray and uploaded to https://code.google.com/p/xpmem. This version supports
> any kernel 3.12 and newer. *Keep in mind there may be bugs and this version
> may cause kernel panics, code crashes, eat your cat, etc.*
> 
> Installing this on my laptop where I just want developing with SHMEM
> it would be a pitty to lose work just because of that.
> 
> Best,
> Bert
> 
>> 
>> Thanks,
>> Takahiro Kawashima,
>> Fujitsu
>> 
>>> Hello Bert,
>>> 
>>> What OS are you running on your notebook?
>>> 
>>> If you are running Linux, and you have root access to your system,  then
>>> you should be able to resolve the Open SHMEM support issue by installing
>>> the XPMEM device driver on your system, and rebuilding UCX so it picks
>>> up XPMEM support.
>>> 
>>> The source code is on GitHub:
>>> 
>>> https://github.com/hjelmn/xpmem
>>> 
>>> Some instructions on how to build the xpmem device driver are at
>>> 
>>> https://github.com/hjelmn/xpmem/wiki/Installing-XPMEM
>>> 
>>> You will need to install the kernel source and symbols rpms on your
>>> system before building the xpmem device driver.
>>> 
>>> Hope this helps,
>>> 
>>> Howard
>>> 
>>> 
>>> Am Di., 13. Nov. 2018 um 15:00 Uhr schrieb Bert Wesarg via users <
>>> users@lists.open-mpi.org>:
>>> 
>>>> Hi,
>>>> 
>>>> On Mon, Nov 12, 2018 at 10:49 PM Pritchard Jr., Howard via announce
>>>> <annou...@lists.open-mpi.org> wrote:
>>>>> 
>>>>> The Open MPI Team, representing a consortium of research, academic, and
>>>>> industry partners, is pleased to announce the release of Open MPI version
>>>>> 4.0.0.
>>>>> 
>>>>> v4.0.0 is the start of a new release series for Open MPI.  Starting with
>>>>> this release, the OpenIB BTL supports only iWarp and RoCE by default.
>>>>> Starting with this release,  UCX is the preferred transport protocol
>>>>> for Infiniband interconnects. The embedded PMIx runtime has been updated
>>>>> to 3.0.2.  The embedded Romio has been updated to 3.2.1.  This
>>>>> release is ABI compatible with the 3.x release streams. There have been
>>>> numerous
>>>>> other bug fixes and performance improvements.
>>>>> 
>>>>> Note that starting with Open MPI v4.0.0, prototypes for several
>>>>> MPI-1 symbols that were deleted in the MPI-3.0 specification
>>>>> (which was published in 2012) are no longer available by default in
>>>>> mpi.h. See the README for further details.
>>>>> 
>>>>> Version 4.0.0 can be downloaded from the main Open MPI web site:
>>>>> 
>>>>>  https://www.open-mpi.org/software/ompi/v4.0/
>>>>> 
>>>>> 
>>>>> 4.0.0 -- September, 2018
>>>>> ------------------------
>>>>> 
>>>>> - OSHMEM updated to the OpenSHMEM 1.4 API.
>>>>> - Do not build OpenSHMEM layer when there are no SPMLs available.
>>>>>  Currently, this means the OpenSHMEM layer will only build if
>>>>>  a MXM or UCX library is found.
>>>> 
>>>> so what is the most convenience way to get SHMEM working on a single
>>>> shared memory node (aka. notebook)? I just realized that I don't have
>>>> a SHMEM since Open MPI 3.0. But building with UCX does not help
>>>> either. I tried with UCX 1.4 but Open MPI SHMEM
>>>> still does not work:
>>>> 
>>>> $ oshcc -o shmem_hello_world-4.0.0 openmpi-4.0.0/examples/hello_oshmem_c.c
>>>> $ oshrun -np 2 ./shmem_hello_world-4.0.0
>>>> [1542109710.217344] [tudtug:27715:0]         select.c:406  UCX  ERROR
>>>> no remote registered memory access transport to tudtug:27716:
>>>> self/self - Destination is unreachable, tcp/enp0s31f6 - no put short,
>>>> tcp/wlp61s0 - no put short, mm/sysv - Destination is unreachable,
>>>> mm/posix - Destination is unreachable, cma/cma - no put short
>>>> [1542109710.217344] [tudtug:27716:0]         select.c:406  UCX  ERROR
>>>> no remote registered memory access transport to tudtug:27715:
>>>> self/self - Destination is unreachable, tcp/enp0s31f6 - no put short,
>>>> tcp/wlp61s0 - no put short, mm/sysv - Destination is unreachable,
>>>> mm/posix - Destination is unreachable, cma/cma - no put short
>>>> [tudtug:27715] ../../../../../oshmem/mca/spml/ucx/spml_ucx.c:266
>>>> Error: ucp_ep_create(proc=1/2) failed: Destination is unreachable
>>>> [tudtug:27715] ../../../../../oshmem/mca/spml/ucx/spml_ucx.c:305
>>>> Error: add procs FAILED rc=-2
>>>> [tudtug:27716] ../../../../../oshmem/mca/spml/ucx/spml_ucx.c:266
>>>> Error: ucp_ep_create(proc=1/2) failed: Destination is unreachable
>>>> [tudtug:27716] ../../../../../oshmem/mca/spml/ucx/spml_ucx.c:305
>>>> Error: add procs FAILED rc=-2
>>>> --------------------------------------------------------------------------
>>>> It looks like SHMEM_INIT failed for some reason; your parallel process is
>>>> likely to abort.  There are many reasons that a parallel process can
>>>> fail during SHMEM_INIT; some of which are due to configuration or
>>>> environment
>>>> problems.  This failure appears to be an internal failure; here's some
>>>> additional information (which may only be relevant to an Open SHMEM
>>>> developer):
>>>> 
>>>>  SPML add procs failed
>>>>  --> Returned "Out of resource" (-2) instead of "Success" (0)
>>>> --------------------------------------------------------------------------
>>>> [tudtug:27715] Error: pshmem_init.c:80 - _shmem_init() SHMEM failed to
>>>> initialize - aborting
>>>> [tudtug:27716] Error: pshmem_init.c:80 - _shmem_init() SHMEM failed to
>>>> initialize - aborting
>>>> --------------------------------------------------------------------------
>>>> SHMEM_ABORT was invoked on rank 0 (pid 27715, host=tudtug) with errorcode
>>>> -1.
>>>> --------------------------------------------------------------------------
>>>> --------------------------------------------------------------------------
>>>> A SHMEM process is aborting at a time when it cannot guarantee that all
>>>> of its peer processes in the job will be killed properly.  You should
>>>> double check that everything has shut down cleanly.
>>>> 
>>>> Local host: tudtug
>>>> PID:        27715
>>>> --------------------------------------------------------------------------
>>>> --------------------------------------------------------------------------
>>>> Primary job  terminated normally, but 1 process returned
>>>> a non-zero exit code. Per user-direction, the job has been aborted.
>>>> --------------------------------------------------------------------------
>>>> --------------------------------------------------------------------------
>>>> oshrun detected that one or more processes exited with non-zero
>>>> status, thus causing
>>>> the job to be terminated. The first process to do so was:
>>>> 
>>>>  Process name: [[2212,1],1]
>>>>  Exit code:    255
>>>> --------------------------------------------------------------------------
>>>> [tudtug:27710] 1 more process has sent help message
>>>> help-shmem-runtime.txt / shmem_init:startup:internal-failure
>>>> [tudtug:27710] Set MCA parameter "orte_base_help_aggregate" to 0 to
>>>> see all help / error messages
>>>> [tudtug:27710] 1 more process has sent help message help-shmem-api.txt
>>>> / shmem-abort
>>>> [tudtug:27710] 1 more process has sent help message
>>>> help-shmem-runtime.txt / oshmem shmem abort:cannot guarantee all
>>>> killed
>>>> 
>>>> MPI works as expected:
>>>> 
>>>> $ mpicc -o mpi_hello_world-4.0.0 openmpi-4.0.0/examples/hello_c.c
>>>> $ mpirun -np 2 ./mpi_hello_world-4.0.0
>>>> Hello, world, I am 0 of 2, (Open MPI v4.0.0, package: Open MPI
>>>> wesarg@tudtug Distribution, ident: 4.0.0, repo rev: v4.0.0, Nov 12,
>>>> 2018, 108)
>>>> Hello, world, I am 1 of 2, (Open MPI v4.0.0, package: Open MPI
>>>> wesarg@tudtug Distribution, ident: 4.0.0, repo rev: v4.0.0, Nov 12,
>>>> 2018, 108)
>>>> 
>>>> I'm attaching the output from 'ompi_info -a' and also from 'ucx_info
>>>> -b -d -c -s'.
>>>> 
>>>> Thanks for the help.
>> 
>> _______________________________________________
>> users mailing list
>> users@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/users
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to