Re: [OMPI users] Problems using Open MPI 1.8.4 OSHMEM on Intel Xeon Phi/MIC

2015-04-12 Thread Andy Riebs

  
  
Hi Ralph,

Here's the output with openmpi-v1.8.4-202-gc2da6a5.tar.bz2:

$ shmemrun -H localhost -N 2 --mca sshmem mmap  --mca
plm_base_verbose 5 $PWD/mic.out
[atl1-01-mic0:190189] mca:base:select:(  plm) Querying component
[rsh]
[atl1-01-mic0:190189] [[INVALID],INVALID] plm:rsh_lookup on agent
ssh : rsh path NULL
[atl1-01-mic0:190189] mca:base:select:(  plm) Query of component
[rsh] set priority to 10
[atl1-01-mic0:190189] mca:base:select:(  plm) Querying component
[isolated]
[atl1-01-mic0:190189] mca:base:select:(  plm) Query of component
[isolated] set priority to 0
[atl1-01-mic0:190189] mca:base:select:(  plm) Querying component
[slurm]
[atl1-01-mic0:190189] mca:base:select:(  plm) Skipping component
[slurm]. Query failed to return a module
[atl1-01-mic0:190189] mca:base:select:(  plm) Selected component
[rsh]
[atl1-01-mic0:190189] plm:base:set_hnp_name: initial bias 190189
nodename hash 4121194178
[atl1-01-mic0:190189] plm:base:set_hnp_name: final jobfam 32137
[atl1-01-mic0:190189] [[32137,0],0] plm:rsh_setup on agent ssh : rsh
path NULL
[atl1-01-mic0:190189] [[32137,0],0] plm:base:receive start comm
[atl1-01-mic0:190189] [[32137,0],0] plm:base:setup_job
[atl1-01-mic0:190189] [[32137,0],0] plm:base:setup_vm
[atl1-01-mic0:190189] [[32137,0],0] plm:base:setup_vm creating map
[atl1-01-mic0:190189] [[32137,0],0] setup:vm: working unmanaged
allocation
[atl1-01-mic0:190189] [[32137,0],0] using dash_host
[atl1-01-mic0:190189] [[32137,0],0] checking node atl1-01-mic0
[atl1-01-mic0:190189] [[32137,0],0] ignoring myself
[atl1-01-mic0:190189] [[32137,0],0] plm:base:setup_vm only HNP in
allocation
[atl1-01-mic0:190189] [[32137,0],0] complete_setup on job [32137,1]
[atl1-01-mic0:190189] [[32137,0],0] ORTE_ERROR_LOG: Not found in
file base/plm_base_launch_support.c at line 440
[atl1-01-mic0:190189] [[32137,0],0] plm:base:launch_apps for job
[32137,1]
[atl1-01-mic0:190189] [[32137,0],0] plm:base:launch wiring up iof
for job [32137,1]
[atl1-01-mic0:190189] [[32137,0],0] plm:base:launch [32137,1]
registered
[atl1-01-mic0:190189] [[32137,0],0] plm:base:launch job [32137,1] is
not a dynamic spawn
--
It looks like SHMEM_INIT failed for some reason; your parallel
process is
likely to abort.  There are many reasons that a parallel process can
fail during SHMEM_INIT; some of which are due to configuration or
environment
problems.  This failure appears to be an internal failure; here's
some
additional information (which may only be relevant to an Open SHMEM
developer):

  mca_memheap_base_select() failed
  --> Returned "Error" (-1) instead of "Success" (0)
--
[atl1-01-mic0:190191] Error: pshmem_init.c:61 - shmem_init() SHMEM
failed to initialize - aborting
[atl1-01-mic0:190192] Error: pshmem_init.c:61 - shmem_init() SHMEM
failed to initialize - aborting
--
SHMEM_ABORT was invoked on rank 1 (pid 190192, host=atl1-01-mic0)
with errorcode -1.
--
--
A SHMEM process is aborting at a time when it cannot guarantee that
all
of its peer processes in the job will be killed properly.  You
should
double check that everything has shut down cleanly.

Local host: atl1-01-mic0
PID:    190192
--
---
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
---
[atl1-01-mic0:190189] [[32137,0],0] plm:base:orted_cmd sending
orted_exit commands
--
shmemrun detected that one or more processes exited with non-zero
status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[32137,1],0]
  Exit code:    255
--
[atl1-01-mic0:190189] 1 more process has sent help message
help-shmem-runtime.txt / shmem_init:startup:internal-failure
[atl1-01-mic0:190189] Set MCA parameter "orte_base_help_aggregate"
to 0 to see all help / error messages
[atl1-01-mic0:190189] 1 more process has sent help message
help-shmem-api.txt / shmem-abort
[atl1-01-mic0:190189] 1 more process has sent help message
help-shmem-runtim

Re: [OMPI users] Problems using Open MPI 1.8.4 OSHMEM on Intel Xeon Phi/MIC

2015-04-12 Thread Ralph Castain
Sorry about that - I hadn’t brought it over to the 1.8 branch yet. I’ve done so 
now, which means the ERROR_LOG shouldn’t show up any more. It won’t fix the 
memheap problem, though.

You might try adding “--mca memheap_base_verbose 100” to your cmd line so we 
can see why none of the memheap components are being selected.


> On Apr 12, 2015, at 11:30 AM, Andy Riebs  wrote:
> 
> Hi Ralph,
> 
> Here's the output with openmpi-v1.8.4-202-gc2da6a5.tar.bz2 
> :
> 
> $ shmemrun -H localhost -N 2 --mca sshmem mmap  --mca plm_base_verbose 5 
> $PWD/mic.out
> [atl1-01-mic0:190189] mca:base:select:(  plm) Querying component [rsh]
> [atl1-01-mic0:190189] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh 
> path NULL
> [atl1-01-mic0:190189] mca:base:select:(  plm) Query of component [rsh] set 
> priority to 10
> [atl1-01-mic0:190189] mca:base:select:(  plm) Querying component [isolated]
> [atl1-01-mic0:190189] mca:base:select:(  plm) Query of component [isolated] 
> set priority to 0
> [atl1-01-mic0:190189] mca:base:select:(  plm) Querying component [slurm]
> [atl1-01-mic0:190189] mca:base:select:(  plm) Skipping component [slurm]. 
> Query failed to return a module
> [atl1-01-mic0:190189] mca:base:select:(  plm) Selected component [rsh]
> [atl1-01-mic0:190189] plm:base:set_hnp_name: initial bias 190189 nodename 
> hash 4121194178
> [atl1-01-mic0:190189] plm:base:set_hnp_name: final jobfam 32137
> [atl1-01-mic0:190189] [[32137,0],0] plm:rsh_setup on agent ssh : rsh path NULL
> [atl1-01-mic0:190189] [[32137,0],0] plm:base:receive start comm
> [atl1-01-mic0:190189] [[32137,0],0] plm:base:setup_job
> [atl1-01-mic0:190189] [[32137,0],0] plm:base:setup_vm
> [atl1-01-mic0:190189] [[32137,0],0] plm:base:setup_vm creating map
> [atl1-01-mic0:190189] [[32137,0],0] setup:vm: working unmanaged allocation
> [atl1-01-mic0:190189] [[32137,0],0] using dash_host
> [atl1-01-mic0:190189] [[32137,0],0] checking node atl1-01-mic0
> [atl1-01-mic0:190189] [[32137,0],0] ignoring myself
> [atl1-01-mic0:190189] [[32137,0],0] plm:base:setup_vm only HNP in allocation
> [atl1-01-mic0:190189] [[32137,0],0] complete_setup on job [32137,1]
> [atl1-01-mic0:190189] [[32137,0],0] ORTE_ERROR_LOG: Not found in file 
> base/plm_base_launch_support.c at line 440
> [atl1-01-mic0:190189] [[32137,0],0] plm:base:launch_apps for job [32137,1]
> [atl1-01-mic0:190189] [[32137,0],0] plm:base:launch wiring up iof for job 
> [32137,1]
> [atl1-01-mic0:190189] [[32137,0],0] plm:base:launch [32137,1] registered
> [atl1-01-mic0:190189] [[32137,0],0] plm:base:launch job [32137,1] is not a 
> dynamic spawn
> --
> It looks like SHMEM_INIT failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during SHMEM_INIT; some of which are due to configuration or environment
> problems.  This failure appears to be an internal failure; here's some
> additional information (which may only be relevant to an Open SHMEM
> developer):
> 
>   mca_memheap_base_select() failed
>   --> Returned "Error" (-1) instead of "Success" (0)
> --
> [atl1-01-mic0:190191] Error: pshmem_init.c:61 - shmem_init() SHMEM failed to 
> initialize - aborting
> [atl1-01-mic0:190192] Error: pshmem_init.c:61 - shmem_init() SHMEM failed to 
> initialize - aborting
> --
> SHMEM_ABORT was invoked on rank 1 (pid 190192, host=atl1-01-mic0) with 
> errorcode -1.
> --
> --
> A SHMEM process is aborting at a time when it cannot guarantee that all
> of its peer processes in the job will be killed properly.  You should
> double check that everything has shut down cleanly.
> 
> Local host: atl1-01-mic0
> PID:190192
> --
> ---
> Primary job  terminated normally, but 1 process returned
> a non-zero exit code.. Per user-direction, the job has been aborted.
> ---
> [atl1-01-mic0:190189] [[32137,0],0] plm:base:orted_cmd sending orted_exit 
> commands
> --
> shmemrun detected that one or more processes exited with non-zero status, 
> thus causing
> the job to be terminated. The first process to do so was:
> 
>   Process name: [[32137,1],0]
>   Exit code:255
> --
> [atl1-01-mic0:190189] 1 more process has sent help message 
> help-shmem-runtime.txt / shmem_init:startup:internal-failure
> [atl1-01-mic0:190

Re: [OMPI users] Problems using Open MPI 1.8.4 OSHMEM on Intel Xeon Phi/MIC

2015-04-12 Thread Riebs, Andy
My fault, I thought the tar ball name looked funny :-)

Will try again tomorrow

Andy

--
Andy Riebs
andy.ri...@hp.com


 Original message 
From: Ralph Castain
Date:04/12/2015 3:10 PM (GMT-05:00)
To: Open MPI Users
Subject: Re: [OMPI users] Problems using Open MPI 1.8.4 OSHMEM on Intel Xeon 
Phi/MIC

Sorry about that - I hadn’t brought it over to the 1.8 branch yet. I’ve done so 
now, which means the ERROR_LOG shouldn’t show up any more. It won’t fix the 
memheap problem, though.

You might try adding “--mca memheap_base_verbose 100” to your cmd line so we 
can see why none of the memheap components are being selected.


On Apr 12, 2015, at 11:30 AM, Andy Riebs 
mailto:andy.ri...@hp.com>> wrote:

Hi Ralph,

Here's the output with 
openmpi-v1.8.4-202-gc2da6a5.tar.bz2:

$ shmemrun -H localhost -N 2 --mca sshmem mmap  --mca plm_base_verbose 5 
$PWD/mic.out
[atl1-01-mic0:190189] mca:base:select:(  plm) Querying component [rsh]
[atl1-01-mic0:190189] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh 
path NULL
[atl1-01-mic0:190189] mca:base:select:(  plm) Query of component [rsh] set 
priority to 10
[atl1-01-mic0:190189] mca:base:select:(  plm) Querying component [isolated]
[atl1-01-mic0:190189] mca:base:select:(  plm) Query of component [isolated] set 
priority to 0
[atl1-01-mic0:190189] mca:base:select:(  plm) Querying component [slurm]
[atl1-01-mic0:190189] mca:base:select:(  plm) Skipping component [slurm]. Query 
failed to return a module
[atl1-01-mic0:190189] mca:base:select:(  plm) Selected component [rsh]
[atl1-01-mic0:190189] plm:base:set_hnp_name: initial bias 190189 nodename hash 
4121194178
[atl1-01-mic0:190189] plm:base:set_hnp_name: final jobfam 32137
[atl1-01-mic0:190189] [[32137,0],0] plm:rsh_setup on agent ssh : rsh path NULL
[atl1-01-mic0:190189] [[32137,0],0] plm:base:receive start comm
[atl1-01-mic0:190189] [[32137,0],0] plm:base:setup_job
[atl1-01-mic0:190189] [[32137,0],0] plm:base:setup_vm
[atl1-01-mic0:190189] [[32137,0],0] plm:base:setup_vm creating map
[atl1-01-mic0:190189] [[32137,0],0] setup:vm: working unmanaged allocation
[atl1-01-mic0:190189] [[32137,0],0] using dash_host
[atl1-01-mic0:190189] [[32137,0],0] checking node atl1-01-mic0
[atl1-01-mic0:190189] [[32137,0],0] ignoring myself
[atl1-01-mic0:190189] [[32137,0],0] plm:base:setup_vm only HNP in allocation
[atl1-01-mic0:190189] [[32137,0],0] complete_setup on job [32137,1]
[atl1-01-mic0:190189] [[32137,0],0] ORTE_ERROR_LOG: Not found in file 
base/plm_base_launch_support.c at line 440
[atl1-01-mic0:190189] [[32137,0],0] plm:base:launch_apps for job [32137,1]
[atl1-01-mic0:190189] [[32137,0],0] plm:base:launch wiring up iof for job 
[32137,1]
[atl1-01-mic0:190189] [[32137,0],0] plm:base:launch [32137,1] registered
[atl1-01-mic0:190189] [[32137,0],0] plm:base:launch job [32137,1] is not a 
dynamic spawn
--
It looks like SHMEM_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during SHMEM_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open SHMEM
developer):

  mca_memheap_base_select() failed
  --> Returned "Error" (-1) instead of "Success" (0)
--
[atl1-01-mic0:190191] Error: pshmem_init.c:61 - shmem_init() SHMEM failed to 
initialize - aborting
[atl1-01-mic0:190192] Error: pshmem_init.c:61 - shmem_init() SHMEM failed to 
initialize - aborting
--
SHMEM_ABORT was invoked on rank 1 (pid 190192, host=atl1-01-mic0) with 
errorcode -1.
--
--
A SHMEM process is aborting at a time when it cannot guarantee that all
of its peer processes in the job will be killed properly.  You should
double check that everything has shut down cleanly.

Local host: atl1-01-mic0
PID:190192
--
---
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
---
[atl1-01-mic0:190189] [[32137,0],0] plm:base:orted_cmd sending orted_exit 
commands
--
shmemrun detected that one or more processes exited with non-zero status, thus 
causing
the job to be terminated. The first process to do so was:

  Process name: [[32137,1],0]
  Exit code:255
---

Re: [OMPI users] MPI_THREAD_MULTIPLE and openib btl

2015-04-12 Thread Subhra Mazumdar
Hi,

I used mxm mtl as follows but getting segfault. It says mxm component not
found but I have compiled openmpi with mxm. Any idea what I might be
missing?

[root@JARVICE ~]# ./openmpi-1.8.4/openmpinstall/bin/mpirun
--allow-run-as-root --mca pml cm --mca mtl mxm -n 1 -x
LD_PRELOAD=./openmpi-1.8.4/openmpinstall/lib/libmpi.so.1 ./backend
localhosst : -n 1 -x LD_PRELOAD="./libci.so
./openmpi-1.8.4/openmpinstall/lib/libmpi.so.1" ./app2
 i am backend
[JARVICE:08398] *** Process received signal ***
[JARVICE:08398] Signal: Segmentation fault (11)
[JARVICE:08398] Signal code: Address not mapped (1)
[JARVICE:08398] Failing at address: 0x10
[JARVICE:08398] [ 0] /lib64/libpthread.so.0(+0xf710)[0x7ff8d0ddb710]
[JARVICE:08398] [ 1]
/root/openmpi-1.8.4/openmpinstall/lib/libopen-pal.so.6(mca_base_components_close+0x21)[0x7ff8cf9ae491]
[JARVICE:08398] [ 2]
/root/openmpi-1.8.4/openmpinstall/lib/libopen-pal.so.6(mca_base_framework_close+0x6a)[0x7ff8cf9b799a]
[JARVICE:08398] [ 3]
/root/openmpi-1.8.4/openmpinstall/lib/libopen-pal.so.6(mca_base_component_close+0x21)[0x7ff8cf9ae431]
[JARVICE:08398] [ 4]
/root/openmpi-1.8.4/openmpinstall/lib/libopen-pal.so.6(mca_base_framework_components_open+0x11c)[0x7ff8cf9ae15c]
[JARVICE:08398] [ 5]
./openmpi-1.8.4/openmpinstall/lib/libmpi.so.1(+0xa0de9)[0x7ff8d1089de9]
[JARVICE:08398] [ 6]
/root/openmpi-1.8.4/openmpinstall/lib/libopen-pal.so.6(mca_base_framework_open+0x7c)[0x7ff8cf9b7b1c]
[JARVICE:08398] [ 7] [JARVICE:08398] mca: base: components_open: component
pml / cm open function failed
./openmpi-1.8.4/openmpinstall/lib/libmpi.so.1(ompi_mpi_init+0x4b3)[0x7ff8d102ceb3]
[JARVICE:08398] [ 8]
./openmpi-1.8.4/openmpinstall/lib/libmpi.so.1(PMPI_Init_thread+0x100)[0x7ff8d1050cb0]
[JARVICE:08398] [ 9] ./backend[0x404fdf]
[JARVICE:08398] [10]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x7ff8cfeded1d]
[JARVICE:08398] [11] ./backend[0x402db9]
[JARVICE:08398] *** End of error message ***
--
A requested component was not found, or was unable to be opened.  This
means that this component is either not installed or is unable to be
used on your system (e.g., sometimes this means that shared libraries
that the component requires are unable to be found/loaded).  Note that
Open MPI stopped checking at the first component that it did not find.

Host:  JARVICE
Framework: mtl
Component: mxm
--
--
mpirun noticed that process rank 0 with PID 8398 on node JARVICE exited on
signal 11 (Segmentation fault).
--


Subhra.


On Fri, Apr 10, 2015 at 12:12 AM, Mike Dubman 
wrote:

> no need IPoIB, mxm uses native IB.
>
> Please see HPCX (pre-compiled ompi, integrated with MXM and FCA) README
> file for details how to compile/select.
>
> The default transport is UD for internode communication and shared-memory
> for intra-node.
>
> http://bgate,mellanox.com/products/hpcx/
>
> Also, mxm included in the Mellanox OFED.
>
> On Fri, Apr 10, 2015 at 5:26 AM, Subhra Mazumdar <
> subhramazumd...@gmail.com> wrote:
>
>> Hi,
>>
>> Does ipoib need to be configured on the ib cards for mxm (I have a
>> separate ethernet connection too)? Also are there special flags in mpirun
>> to select from UD/RC/DC? What is the default?
>>
>> Thanks,
>> Subhra.
>>
>>
>> On Tue, Mar 31, 2015 at 9:46 AM, Mike Dubman 
>> wrote:
>>
>>> Hi,
>>> mxm uses IB rdma/roce technologies. Once can select UD/RC/DC transports
>>> to be used in mxm.
>>>
>>> By selecting mxm, all MPI p2p routines will be mapped to appropriate mxm
>>> functions.
>>>
>>> M
>>>
>>> On Mon, Mar 30, 2015 at 7:32 PM, Subhra Mazumdar <
>>> subhramazumd...@gmail.com> wrote:
>>>
 Hi MIke,

 Does the mxm mtl use infiniband rdma? Also from programming
 perspective, do I need to use anything else other than MPI_Send/MPI_Recv?

 Thanks,
 Subhra.


 On Sun, Mar 29, 2015 at 11:14 PM, Mike Dubman >>> > wrote:

> Hi,
> openib btl does not support this thread model.
> You can use OMPI w/ mxm (-mca mtl mxm) and multiple thread mode lin
> 1.8 x series or (-mca pml yalla) in the master branch.
>
> M
>
> On Mon, Mar 30, 2015 at 9:09 AM, Subhra Mazumdar <
> subhramazumd...@gmail.com> wrote:
>
>> Hi,
>>
>> Can MPI_THREAD_MULTIPLE and openib btl work together in open mpi
>> 1.8.4? If so are there any command line options needed during run time?
>>
>> Thanks,
>> Subhra.
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2015/03/26574.php
>>
>
>
>
> --
>
> Kind Regards,
>