[OMPI users] Failed to register memory (openmpi 2.0.2)

2017-10-18 Thread Mark Dixon

Hi,

We're intermittently seeing messages (below) about failing to register 
memory with openmpi 2.0.2 on centos7 / Mellanox FDR Connect-X 3 and the 
vanilla IB stack as shipped by centos.


We're not using any mlx4_core module tweaks at the moment. On earlier 
machines we used to set registered memory as per the FAQ, but neither 
log_num_mtt nor num_mtt seem to exist these days (according to 
/sys/module/mlx4_*/parameters/*), which makes it somewhat difficult to 
follow the FAQ.


The output of 'ulimit -l' shows as unlimited for every rank.

Does anyone have any advice, please?

Thanks,

Mark

-
Failed to register memory region (MR):

Hostname: dc1s0b1c
Address:  ec5000
Length:   20480
Error:Cannot allocate memory
--
--
Open MPI has detected that there are UD-capable Verbs devices on your
system, but none of them were able to be setup properly.  This may
indicate a problem on this system.

You job will continue, but Open MPI will ignore the "ud" oob component
in this run.
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


[OMPI users] IpV6 Openmpi mpirun failed

2017-10-18 Thread Mukkie
Hi,

I have two ipv6 only machines, I configured/built OMPI version 3.0 with -
-enable-ipv6

I want to verify a simple MPI communication call through tcp ip between
these two machines. I am using ring_c and connectivity_c examples.



Issuing from one of the host machine…

[mselvam@ipv-rhel73 examples]$  mpirun -hostfile host --mca btl tcp,self
--mca oob_base_verbose 100 ring_c

.
.

[ipv-rhel71a.locallab.local:10822] [[5331,0],1] tcp_peer_send_blocking:
send() to socket 20 failed: Broken pipe (32)


where “host” contains the ipv6 address of the remote machine (namely –
‘ipv-rhel71a’). Also I have passwordless ssh setup to the remote machine.



I will attach a verbose output in the follow-up post.

Thanks.



Cordially,



*Mukundhan Selvam*

Development Engineer, HPC

[image: MSC Software] 

4675 MacArthur Court, Newport Beach, CA 92660

714-540-8900 ext. 4166
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Failed to register memory (openmpi 2.0.2)

2017-10-18 Thread r...@open-mpi.org
Put “oob=tcp” in your default MCA param file

> On Oct 18, 2017, at 9:00 AM, Mark Dixon  wrote:
> 
> Hi,
> 
> We're intermittently seeing messages (below) about failing to register memory 
> with openmpi 2.0.2 on centos7 / Mellanox FDR Connect-X 3 and the vanilla IB 
> stack as shipped by centos.
> 
> We're not using any mlx4_core module tweaks at the moment. On earlier 
> machines we used to set registered memory as per the FAQ, but neither 
> log_num_mtt nor num_mtt seem to exist these days (according to 
> /sys/module/mlx4_*/parameters/*), which makes it somewhat difficult to follow 
> the FAQ.
> 
> The output of 'ulimit -l' shows as unlimited for every rank.
> 
> Does anyone have any advice, please?
> 
> Thanks,
> 
> Mark
> 
> -
> Failed to register memory region (MR):
> 
> Hostname: dc1s0b1c
> Address:  ec5000
> Length:   20480
> Error:Cannot allocate memory
> --
> --
> Open MPI has detected that there are UD-capable Verbs devices on your
> system, but none of them were able to be setup properly.  This may
> indicate a problem on this system.
> 
> You job will continue, but Open MPI will ignore the "ud" oob component
> in this run.
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] IpV6 Openmpi mpirun failed

2017-10-18 Thread Mukkie
Adding a verbose output. Please check for failed and advise. Thank you.

[mselvam@ipv-rhel73 examples]$ mpirun -hostfile host --mca oob_base_verbose
100 --mca btl tcp,self ring_c
[ipv-rhel73:10575] mca_base_component_repository_open: unable to open
mca_plm_tm: libtorque.so.2: cannot open shared object file: No such file or
directory (ignored)
[ipv-rhel73:10575] mca: base: components_register: registering framework
oob components
[ipv-rhel73:10575] mca: base: components_register: found loaded component
tcp
[ipv-rhel73:10575] mca: base: components_register: component tcp register
function successful
[ipv-rhel73:10575] mca: base: components_open: opening oob components
[ipv-rhel73:10575] mca: base: components_open: found loaded component tcp
[ipv-rhel73:10575] mca: base: components_open: component tcp open function
successful
[ipv-rhel73:10575] mca:oob:select: checking available component tcp
[ipv-rhel73:10575] mca:oob:select: Querying component [tcp]
[ipv-rhel73:10575] oob:tcp: component_available called
[ipv-rhel73:10575] WORKING INTERFACE 1 KERNEL INDEX 2 FAMILY: V6
[ipv-rhel73:10575] [[20058,0],0] oob:tcp:init adding
fe80::b9b:ac5d:9cf0:b858 to our list of V6 connections
[ipv-rhel73:10575] WORKING INTERFACE 2 KERNEL INDEX 1 FAMILY: V4
[ipv-rhel73:10575] [[20058,0],0] oob:tcp:init rejecting loopback interface
lo
[ipv-rhel73:10575] WORKING INTERFACE 3 KERNEL INDEX 4 FAMILY: V4
[ipv-rhel73:10575] [[20058,0],0] TCP STARTUP
[ipv-rhel73:10575] [[20058,0],0] attempting to bind to IPv4 port 0
[ipv-rhel73:10575] [[20058,0],0] assigned IPv4 port 53438
[ipv-rhel73:10575] [[20058,0],0] attempting to bind to IPv6 port 0
[ipv-rhel73:10575] [[20058,0],0] assigned IPv6 port 43370
[ipv-rhel73:10575] mca:oob:select: Adding component to end
[ipv-rhel73:10575] mca:oob:select: Found 1 active transports
[ipv-rhel73:10575] [[20058,0],0]: get transports
[ipv-rhel73:10575] [[20058,0],0]:get transports for component tcp
[ipv-rhel73:10575] mca_base_component_repository_open: unable to open
mca_ras_tm: libtorque.so.2: cannot open shared object file: No such file or
directory (ignored)
[ipv-rhel71a.locallab.local:12299] mca: base: components_register:
registering framework oob components
[ipv-rhel71a.locallab.local:12299] mca: base: components_register: found
loaded component tcp
[ipv-rhel71a.locallab.local:12299] mca: base: components_register:
component tcp register function successful
[ipv-rhel71a.locallab.local:12299] mca: base: components_open: opening oob
components
[ipv-rhel71a.locallab.local:12299] mca: base: components_open: found loaded
component tcp
[ipv-rhel71a.locallab.local:12299] mca: base: components_open: component
tcp open function successful
[ipv-rhel71a.locallab.local:12299] mca:oob:select: checking available
component tcp
[ipv-rhel71a.locallab.local:12299] mca:oob:select: Querying component [tcp]
[ipv-rhel71a.locallab.local:12299] oob:tcp: component_available called
[ipv-rhel71a.locallab.local:12299] WORKING INTERFACE 1 KERNEL INDEX 2
FAMILY: V6
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] oob:tcp:init adding
fe80::226:b9ff:fe85:6a28 to our list of V6 connections
[ipv-rhel71a.locallab.local:12299] WORKING INTERFACE 2 KERNEL INDEX 1
FAMILY: V4
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] oob:tcp:init rejecting
loopback interface lo
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] TCP STARTUP
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] attempting to bind to IPv4
port 0
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] assigned IPv4 port 50782
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] attempting to bind to IPv6
port 0
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] assigned IPv6 port 59268
[ipv-rhel71a.locallab.local:12299] mca:oob:select: Adding component to end
[ipv-rhel71a.locallab.local:12299] mca:oob:select: Found 1 active transports
[ipv-rhel71a.locallab.local:12299] [[20058,0],1]: get transports
[ipv-rhel71a.locallab.local:12299] [[20058,0],1]:get transports for
component tcp
[ipv-rhel71a.locallab.local:12299] [[20058,0],1]: set_addr to uri
1314521088.0;tcp6://[fe80::b9b:ac5d:9cf0:b858]:43370
[ipv-rhel71a.locallab.local:12299] [[20058,0],1]:set_addr checking if peer
[[20058,0],0] is reachable via component tcp
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] oob:tcp: working peer
[[20058,0],0] address tcp6://[fe80::b9b:ac5d:9cf0:b858]:43370
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] SET_PEER ADDING PEER
[[20058,0],0]
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] set_peer: peer
[[20058,0],0] is listening on net fe80::b9b:ac5d:9cf0:b858 port 43370
[ipv-rhel71a.locallab.local:12299] [[20058,0],1]: peer [[20058,0],0] is
reachable via component tcp
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] OOB_SEND:
rml_oob_send.c:265
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] oob:base:send to target
[[20058,0],0] - attempt 0
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] oob:tcp:send_nb to peer
[[20058,0],0]:10 seq = -1
[ipv-rhel71a.locallab.local:12299] [[20058,0],1]:[oob_tcp.c:204] pr

Re: [OMPI users] IpV6 Openmpi mpirun failed

2017-10-18 Thread r...@open-mpi.org
Looks like there is a firewall or something blocking communication between 
those nodes?

> On Oct 18, 2017, at 1:29 PM, Mukkie  wrote:
> 
> Adding a verbose output. Please check for failed and advise. Thank you.
> 
> [mselvam@ipv-rhel73 examples]$ mpirun -hostfile host --mca oob_base_verbose 
> 100 --mca btl tcp,self ring_c
> [ipv-rhel73:10575] mca_base_component_repository_open: unable to open 
> mca_plm_tm: libtorque.so.2: cannot open shared object file: No such file or 
> directory (ignored)
> [ipv-rhel73:10575] mca: base: components_register: registering framework oob 
> components
> [ipv-rhel73:10575] mca: base: components_register: found loaded component tcp
> [ipv-rhel73:10575] mca: base: components_register: component tcp register 
> function successful
> [ipv-rhel73:10575] mca: base: components_open: opening oob components
> [ipv-rhel73:10575] mca: base: components_open: found loaded component tcp
> [ipv-rhel73:10575] mca: base: components_open: component tcp open function 
> successful
> [ipv-rhel73:10575] mca:oob:select: checking available component tcp
> [ipv-rhel73:10575] mca:oob:select: Querying component [tcp]
> [ipv-rhel73:10575] oob:tcp: component_available called
> [ipv-rhel73:10575] WORKING INTERFACE 1 KERNEL INDEX 2 FAMILY: V6
> [ipv-rhel73:10575] [[20058,0],0] oob:tcp:init adding fe80::b9b:ac5d:9cf0:b858 
> to our list of V6 connections
> [ipv-rhel73:10575] WORKING INTERFACE 2 KERNEL INDEX 1 FAMILY: V4
> [ipv-rhel73:10575] [[20058,0],0] oob:tcp:init rejecting loopback interface lo
> [ipv-rhel73:10575] WORKING INTERFACE 3 KERNEL INDEX 4 FAMILY: V4
> [ipv-rhel73:10575] [[20058,0],0] TCP STARTUP
> [ipv-rhel73:10575] [[20058,0],0] attempting to bind to IPv4 port 0
> [ipv-rhel73:10575] [[20058,0],0] assigned IPv4 port 53438
> [ipv-rhel73:10575] [[20058,0],0] attempting to bind to IPv6 port 0
> [ipv-rhel73:10575] [[20058,0],0] assigned IPv6 port 43370
> [ipv-rhel73:10575] mca:oob:select: Adding component to end
> [ipv-rhel73:10575] mca:oob:select: Found 1 active transports
> [ipv-rhel73:10575] [[20058,0],0]: get transports
> [ipv-rhel73:10575] [[20058,0],0]:get transports for component tcp
> [ipv-rhel73:10575] mca_base_component_repository_open: unable to open 
> mca_ras_tm: libtorque.so.2: cannot open shared object file: No such file or 
> directory (ignored)
> [ipv-rhel71a.locallab.local:12299] mca: base: components_register: 
> registering framework oob components
> [ipv-rhel71a.locallab.local:12299] mca: base: components_register: found 
> loaded component tcp
> [ipv-rhel71a.locallab.local:12299] mca: base: components_register: component 
> tcp register function successful
> [ipv-rhel71a.locallab.local:12299] mca: base: components_open: opening oob 
> components
> [ipv-rhel71a.locallab.local:12299] mca: base: components_open: found loaded 
> component tcp
> [ipv-rhel71a.locallab.local:12299] mca: base: components_open: component tcp 
> open function successful
> [ipv-rhel71a.locallab.local:12299] mca:oob:select: checking available 
> component tcp
> [ipv-rhel71a.locallab.local:12299] mca:oob:select: Querying component [tcp]
> [ipv-rhel71a.locallab.local:12299] oob:tcp: component_available called
> [ipv-rhel71a.locallab.local:12299] WORKING INTERFACE 1 KERNEL INDEX 2 FAMILY: 
> V6
> [ipv-rhel71a.locallab.local:12299] [[20058,0],1] oob:tcp:init adding 
> fe80::226:b9ff:fe85:6a28 to our list of V6 connections
> [ipv-rhel71a.locallab.local:12299] WORKING INTERFACE 2 KERNEL INDEX 1 FAMILY: 
> V4
> [ipv-rhel71a.locallab.local:12299] [[20058,0],1] oob:tcp:init rejecting 
> loopback interface lo
> [ipv-rhel71a.locallab.local:12299] [[20058,0],1] TCP STARTUP
> [ipv-rhel71a.locallab.local:12299] [[20058,0],1] attempting to bind to IPv4 
> port 0
> [ipv-rhel71a.locallab.local:12299] [[20058,0],1] assigned IPv4 port 50782
> [ipv-rhel71a.locallab.local:12299] [[20058,0],1] attempting to bind to IPv6 
> port 0
> [ipv-rhel71a.locallab.local:12299] [[20058,0],1] assigned IPv6 port 59268
> [ipv-rhel71a.locallab.local:12299] mca:oob:select: Adding component to end
> [ipv-rhel71a.locallab.local:12299] mca:oob:select: Found 1 active transports
> [ipv-rhel71a.locallab.local:12299] [[20058,0],1]: get transports
> [ipv-rhel71a.locallab.local:12299] [[20058,0],1]:get transports for component 
> tcp
> [ipv-rhel71a.locallab.local:12299] [[20058,0],1]: set_addr to uri 
> 1314521088.0;tcp6://[fe80::b9b:ac5d:9cf0:b858]:43370
> [ipv-rhel71a.locallab.local:12299] [[20058,0],1]:set_addr checking if peer 
> [[20058,0],0] is reachable via component tcp
> [ipv-rhel71a.locallab.local:12299] [[20058,0],1] oob:tcp: working peer 
> [[20058,0],0] address tcp6://[fe80::b9b:ac5d:9cf0:b858]:43370
> [ipv-rhel71a.locallab.local:12299] [[20058,0],1] SET_PEER ADDING PEER 
> [[20058,0],0]
> [ipv-rhel71a.locallab.local:12299] [[20058,0],1] set_peer: peer [[20058,0],0] 
> is listening on net fe80::b9b:ac5d:9cf0:b858 port 43370
> [ipv-rhel71a.locallab.local:12299] [[20058,0],1]: peer [[20058,0],0] is 
> reachable via component tcp
> [ip

Re: [OMPI users] IpV6 Openmpi mpirun failed

2017-10-18 Thread Mukkie
Thanks for your suggestion. However my firewall's are already disabled on
both the machines.

Cordially,
Muku.

On Wed, Oct 18, 2017 at 2:38 PM, r...@open-mpi.org  wrote:

> Looks like there is a firewall or something blocking communication between
> those nodes?
>
> On Oct 18, 2017, at 1:29 PM, Mukkie  wrote:
>
> Adding a verbose output. Please check for failed and advise. Thank you.
>
> [mselvam@ipv-rhel73 examples]$ mpirun -hostfile host --mca
> oob_base_verbose 100 --mca btl tcp,self ring_c
> [ipv-rhel73:10575] mca_base_component_repository_open: unable to open
> mca_plm_tm: libtorque.so.2: cannot open shared object file: No such file or
> directory (ignored)
> [ipv-rhel73:10575] mca: base: components_register: registering framework
> oob components
> [ipv-rhel73:10575] mca: base: components_register: found loaded component
> tcp
> [ipv-rhel73:10575] mca: base: components_register: component tcp register
> function successful
> [ipv-rhel73:10575] mca: base: components_open: opening oob components
> [ipv-rhel73:10575] mca: base: components_open: found loaded component tcp
> [ipv-rhel73:10575] mca: base: components_open: component tcp open function
> successful
> [ipv-rhel73:10575] mca:oob:select: checking available component tcp
> [ipv-rhel73:10575] mca:oob:select: Querying component [tcp]
> [ipv-rhel73:10575] oob:tcp: component_available called
> [ipv-rhel73:10575] WORKING INTERFACE 1 KERNEL INDEX 2 FAMILY: V6
> [ipv-rhel73:10575] [[20058,0],0] oob:tcp:init adding
> fe80::b9b:ac5d:9cf0:b858 to our list of V6 connections
> [ipv-rhel73:10575] WORKING INTERFACE 2 KERNEL INDEX 1 FAMILY: V4
> [ipv-rhel73:10575] [[20058,0],0] oob:tcp:init rejecting loopback interface
> lo
> [ipv-rhel73:10575] WORKING INTERFACE 3 KERNEL INDEX 4 FAMILY: V4
> [ipv-rhel73:10575] [[20058,0],0] TCP STARTUP
> [ipv-rhel73:10575] [[20058,0],0] attempting to bind to IPv4 port 0
> [ipv-rhel73:10575] [[20058,0],0] assigned IPv4 port 53438
> [ipv-rhel73:10575] [[20058,0],0] attempting to bind to IPv6 port 0
> [ipv-rhel73:10575] [[20058,0],0] assigned IPv6 port 43370
> [ipv-rhel73:10575] mca:oob:select: Adding component to end
> [ipv-rhel73:10575] mca:oob:select: Found 1 active transports
> [ipv-rhel73:10575] [[20058,0],0]: get transports
> [ipv-rhel73:10575] [[20058,0],0]:get transports for component tcp
> [ipv-rhel73:10575] mca_base_component_repository_open: unable to open
> mca_ras_tm: libtorque.so.2: cannot open shared object file: No such file or
> directory (ignored)
> [ipv-rhel71a.locallab.local:12299] mca: base: components_register:
> registering framework oob components
> [ipv-rhel71a.locallab.local:12299] mca: base: components_register: found
> loaded component tcp
> [ipv-rhel71a.locallab.local:12299] mca: base: components_register:
> component tcp register function successful
> [ipv-rhel71a.locallab.local:12299] mca: base: components_open: opening
> oob components
> [ipv-rhel71a.locallab.local:12299] mca: base: components_open: found
> loaded component tcp
> [ipv-rhel71a.locallab.local:12299] mca: base: components_open: component
> tcp open function successful
> [ipv-rhel71a.locallab.local:12299] mca:oob:select: checking available
> component tcp
> [ipv-rhel71a.locallab.local:12299] mca:oob:select: Querying component
> [tcp]
> [ipv-rhel71a.locallab.local:12299] oob:tcp: component_available called
> [ipv-rhel71a.locallab.local:12299] WORKING INTERFACE 1 KERNEL INDEX 2
> FAMILY: V6
> [ipv-rhel71a.locallab.local:12299] [[20058,0],1] oob:tcp:init adding
> fe80::226:b9ff:fe85:6a28 to our list of V6 connections
> [ipv-rhel71a.locallab.local:12299] WORKING INTERFACE 2 KERNEL INDEX 1
> FAMILY: V4
> [ipv-rhel71a.locallab.local:12299] [[20058,0],1] oob:tcp:init rejecting
> loopback interface lo
> [ipv-rhel71a.locallab.local:12299] [[20058,0],1] TCP STARTUP
> [ipv-rhel71a.locallab.local:12299] [[20058,0],1] attempting to bind to
> IPv4 port 0
> [ipv-rhel71a.locallab.local:12299] [[20058,0],1] assigned IPv4 port 50782
> [ipv-rhel71a.locallab.local:12299] [[20058,0],1] attempting to bind to
> IPv6 port 0
> [ipv-rhel71a.locallab.local:12299] [[20058,0],1] assigned IPv6 port 59268
> [ipv-rhel71a.locallab.local:12299] mca:oob:select: Adding component to end
> [ipv-rhel71a.locallab.local:12299] mca:oob:select: Found 1 active
> transports
> [ipv-rhel71a.locallab.local:12299] [[20058,0],1]: get transports
> [ipv-rhel71a.locallab.local:12299] [[20058,0],1]:get transports for
> component tcp
> [ipv-rhel71a.locallab.local:12299] [[20058,0],1]: set_addr to uri
> 1314521088.0;tcp6://[fe80::b9b:ac5d:9cf0:b858]:43370
> [ipv-rhel71a.locallab.local:12299] [[20058,0],1]:set_addr checking if
> peer [[20058,0],0] is reachable via component tcp
> [ipv-rhel71a.locallab.local:12299] [[20058,0],1] oob:tcp: working peer
> [[20058,0],0] address tcp6://[fe80::b9b:ac5d:9cf0:b858]:43370
> [ipv-rhel71a.locallab.local:12299] [[20058,0],1] SET_PEER ADDING PEER
> [[20058,0],0]
> [ipv-rhel71a.locallab.local:12299] [[20058,0],1] set_peer: peer
> [[20058,0],0] is listening